# TermRecord Linkage

This task aims to link ConceptRecords from different documents. This will be beneficial to remove ambiguity of historical data. Take encyclopaedia Britannica for example, a term with a specific name could have multiple meanings (concepts) - Polysemy. When studying how language changes, it would be good to link and compare the terms with same concept across years.

Inputs: basic knowledge graph (terms records), dataframe with term information, uri, and embeddings.

Steps:

1. List all distinct term names in the dataframe
2. For each term name: get a list of all the terms info with that name, get a list of distinct years, and sort it
   1. for each year in the list, for each term in that year, if there is no concept uri are linked to the term, then create new one. find the most similar term with the same name in each following years, then link the same concept uri to those terms if not exists. keep the process until no further years
3. Add concept uri to the knowledge graph, and link the concept with term record.

In [1]:
import math
# Load eb kg dataframe
import pandas as pd
eb_kg_df = pd.read_json("../eb_kg_hq_with_embeddings_dataframe", orient="index")
eb_kg_df

Unnamed: 0,edition_uri,vol_num,vol_title,genre,print_location,year_published,edition_num,term_uri,note,description,description_uri,summary,term_name,term_type,start_page_num,end_page_num,alter_names,reference_terms,supplements_to,embedding
0,https://w3id.org/hto/Edition/9922270543804340,11,"Fifth edition, Volume 11, HYD-LIE",encyclopedia,Edinburgh,1815,5.0,https://w3id.org/hto/ArticleTermRecord/9922270...,,"in antiquity, a kind of mournful song, used up...",https://w3id.org/hto/OriginalDescription/99222...,,JALEMUS,Article,32,32,[],[],[],"[0.0667808577, -0.1191483587, 0.0258066095, -0..."
1,https://w3id.org/hto/Edition/997902543804341,7,"Third edition, Volume 7, ETM-GOA",encyclopedia,Edinburgh,1797,3.0,https://w3id.org/hto/ArticleTermRecord/9979025...,,"among miners, signifies a piece of earth it wh...",https://w3id.org/hto/OriginalDescription/99790...,,GLEBE,Article,862,862,[],[],[],"[-0.0055958461, -0.0707782134, 0.0171438232, 0..."
2,https://w3id.org/hto/Edition/9929777383804340,6,"Eighth edition, Volume 6, Burning glasses-Climate",encyclopedia,Edinburgh,1853,8.0,https://w3id.org/hto/ArticleTermRecord/9929777...,,"Pietro, the Roman school, who was by Giotto, w...",https://w3id.org/hto/OriginalDescription/99297...,,CAVALLINI,Article,354,354,[],[],[],"[-0.0121824667, 0.0402460583, 0.00665577360000..."
3,https://w3id.org/hto/Edition/997902523804341,1,"Second edition, Volume 1, A-AST",encyclopedia,Edinburgh,1778,2.0,https://w3id.org/hto/ArticleTermRecord/9979025...,,"in antiquity, a denomination given to the sena...",https://w3id.org/hto/OriginalDescription/99790...,,JEINATTE,Article,117,117,[],[],[],"[0.0032457884, -0.0179126002, -0.0030581676000..."
4,https://w3id.org/hto/Edition/9910796233804340,8,"Fourth edition, Volume 8 Part 1, ELE-FAI",encyclopedia,Edinburgh,1810,4.0,https://w3id.org/hto/ArticleTermRecord/9910796...,,"a French term, sometimes life authors to denot...",https://w3id.org/hto/OriginalDescription/99107...,,ESCORT,Article,351,351,[],[],[],"[0.0191970598, -0.026992561300000002, 0.029633..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
177578,https://w3id.org/hto/Edition/9910796373804340,4,"Supplement to the fourth, fifth and sixth edit...",encyclopedia,Edinburgh,1824,,https://w3id.org/hto/TopicTermRecord/991079637...,,Enwology.and those with sheathed wings; he obs...,https://w3id.org/hto/OriginalDescription/99107...,The modifications in the form of the antennas ...,ENTOMOLOGY,Topic,267,291,[],[],"[6, 4, 5]","[0.0345440656, 0.0038321146, 0.0129572209, -0...."
177579,https://w3id.org/hto/Edition/9910796273804340,8,"Seventh edition, Volume 8, DIA-England",encyclopedia,Edinburgh,1842,7.0,https://w3id.org/hto/TopicTermRecord/991079627...,,"> EGYPT. 459 vpt. to illustrate the history, l...",https://w3id.org/hto/OriginalDescription/99107...,and Age of the Monarch determined.—His Charact...,DCTT,Topic,469,469,[],[],[],"[0.0283674542, 0.0216944125, -0.0273680091, -0..."
177580,https://w3id.org/hto/Edition/9910796233804340,20,"Fourth edition, Volume 20 Part 1, SUI-THE",encyclopedia,Edinburgh,1810,4.0,https://w3id.org/hto/TopicTermRecord/991079623...,,^Conftruc^ a=: S7r~ and we go at > after divid...,https://w3id.org/hto/OriginalDescription/99107...,^Conftruc^ a=: S7r~ and we go at > after divid...,TRICON,Topic,91,92,[],[{'uri': 'https://w3id.org/hto/ArticleTermReco...,[],"[-0.017611675, -0.1025323346, -0.0331101865, 0..."
177581,https://w3id.org/hto/Edition/997902543804341,14,"Third edition, Volume 14, PAS-PLA",encyclopedia,Edinburgh,1797,3.0,https://w3id.org/hto/TopicTermRecord/997902543...,,"PIN of pine.trecj, which in common languages w...",https://w3id.org/hto/OriginalDescription/99790...,"PIN of pine.trecj, which in common languages w...",PIONEERS,Topic,795,800,[],[],[],"[-0.007840358700000001, 0.0051432806, -0.00761..."


## Group terms into concepts

In [2]:
# Get all distinct term names
term_names = eb_kg_df["term_name"].unique()
term_names

array(['JALEMUS', 'GLEBE', 'CAVALLINI', ..., 'LATV', 'GERG', 'DCTT'],
      dtype=object)

In [41]:
eb_kg_df["concept_uri"] = None

In [4]:
from sklearn.metrics.pairwise import cosine_similarity
def get_similar_terms_grouped_by_years_sorted_by_score(df):
    """
    This function will calculate the cosine similarities based on the embeddings in the dataframe, sort the scores and group the result by the year when the record is published.
    :param df: input dataframe, which contain the embedding for each term, and year published information
    :return: {
        df_index: {
            1771: [
                {
                    "index": <dataframe index of the similar term>,
                    "score": <cosine similarity score>
                }, ...
            ],
            1815: [
            ],....
        },
        ......
    }
    """
    embeddings = df["embedding"].values.tolist()
    indices = df.index
    similarities=cosine_similarity(embeddings, embeddings)
    similarities_sorted = similarities.argsort()
    result = {}
    for i in range(len(similarities_sorted)):
        for j in similarities_sorted[i][::-1]:
            year = df.loc[indices[j], "year_published"]
            sim_info = {
                    "index": indices[j],
                    "score": similarities[i][j]
                }
            if indices[i] not in result:
                result[indices[i]] = {}
                result[indices[i]][year] = [sim_info]
            else:
                if year in result[indices[i]]:
                    result[indices[i]][year].append(sim_info)
                else:
                    result[indices[i]][year] = [sim_info]

    return result

In [42]:

for term_name in term_names[0:10]:
    # all terms with term_name
    terms_df = eb_kg_df[eb_kg_df["term_name"] == term_name]
    years = terms_df["year_published"].unique().tolist()
    years.sort()
    print(years)
    concept_count = 0
    similarities = get_similar_terms_grouped_by_years_sorted_by_score(terms_df)
    # generate id for the concept
    concept_id = str(terms_df["term_uri"].values.tolist()[0]).split("/")[-1].split("_")[-2]
    for year_index in range(len(years)):
        year = years[year_index]
        print(f"processing terms {term_name} in {year} ")
        # all terms with term_name and the year
        terms_year_df = terms_df[terms_df["year_published"] == year]
        for index, row in terms_year_df.iterrows():
            print(f"linking term with index {index} with other terms across years")
            #print(f"description of the term: {row['description']}")
            concept_uri = eb_kg_df.loc[index, "concept_uri"]
            if concept_uri is None:
                concept_count += 1
                concept_uri = "https://w3id.org/hto/Concept/" + str(concept_id) + "_" + str(concept_count)
                eb_kg_df.loc[index, "concept_uri"] = concept_uri

            # find most similar terms in each following years
            similarity_threshold = 0.7
            for f_year_index in years[year_index + 1:]:
                most_similar_term = similarities[index][f_year_index][0]
                score = most_similar_term["score"]
                similar_term_index = most_similar_term["index"]
                # skip if there is concept uri linked to it already, or the score is below the threshold, or there is another term in this year is more similar the most_similar_term
                if score > similarity_threshold:
                    if eb_kg_df.loc[similar_term_index, "concept_uri"] is None:
                        if similarities[similar_term_index][year][0]["index"] == index:
                            eb_kg_df.loc[similar_term_index, "concept_uri"] = concept_uri
                            print(f"year: {f_year_index}, term {most_similar_term} is linked")
                        else:
                            print(f"year: {f_year_index}, term {most_similar_term} is skipped, because term {similarities[similar_term_index][year][0]['index']} is more similar to the most_similar_term")
                    else:
                        if eb_kg_df.loc[similar_term_index, "concept_uri"] == concept_uri:
                            print("same concept uri")
                            print(f"year: {f_year_index}, term {most_similar_term} is skipped, because it is linked already")
                        else:
                            # This is the solution when the link can't be made directly. e.g. escort: 1797-1815-1823， 1810-1815-2823,
                            concept_uri = eb_kg_df.loc[similar_term_index, "concept_uri"]
                            eb_kg_df.loc[index, "concept_uri"] = concept_uri
                            print(f"replace the concept uri to the one same with {most_similar_term}")
                else:
                    print(f"year: {f_year_index}, term {most_similar_term} is skipped, because it is not quite similar")

[1797, 1810, 1815, 1823, 1842]
processing terms JALEMUS in 1797 
linking term with index 75433 with other terms across years
year: 1810, term {'index': 169876, 'score': 0.8221958914473327} is linked
year: 1815, term {'index': 0, 'score': 0.8772943130567592} is linked
year: 1823, term {'index': 143547, 'score': 0.7765352175988391} is linked
year: 1842, term {'index': 99448, 'score': 0.8602911416885433} is linked
processing terms JALEMUS in 1810 
linking term with index 169876 with other terms across years
same concept uri
year: 1815, term {'index': 0, 'score': 0.847553290689463} is skipped, because it is linked already
same concept uri
year: 1823, term {'index': 143547, 'score': 0.7524103150375007} is skipped, because it is linked already
same concept uri
year: 1842, term {'index': 99448, 'score': 0.831713609751644} is skipped, because it is linked already
processing terms JALEMUS in 1815 
linking term with index 0 with other terms across years
same concept uri
year: 1823, term {'index'

In [43]:
eb_kg_df

Unnamed: 0,edition_uri,vol_num,vol_title,genre,print_location,year_published,edition_num,term_uri,note,description,...,summary,term_name,term_type,start_page_num,end_page_num,alter_names,reference_terms,supplements_to,embedding,concept_uri
0,https://w3id.org/hto/Edition/9922270543804340,11,"Fifth edition, Volume 11, HYD-LIE",encyclopedia,Edinburgh,1815,5.0,https://w3id.org/hto/ArticleTermRecord/9922270...,,"in antiquity, a kind of mournful song, used up...",...,,JALEMUS,Article,32,32,[],[],[],"[0.0667808577, -0.1191483587, 0.0258066095, -0...",https://w3id.org/hto/Concept/1475160240_1
1,https://w3id.org/hto/Edition/997902543804341,7,"Third edition, Volume 7, ETM-GOA",encyclopedia,Edinburgh,1797,3.0,https://w3id.org/hto/ArticleTermRecord/9979025...,,"among miners, signifies a piece of earth it wh...",...,,GLEBE,Article,862,862,[],[],[],"[-0.0055958461, -0.0707782134, 0.0171438232, 0...",https://w3id.org/hto/Concept/2743363218_2
2,https://w3id.org/hto/Edition/9929777383804340,6,"Eighth edition, Volume 6, Burning glasses-Climate",encyclopedia,Edinburgh,1853,8.0,https://w3id.org/hto/ArticleTermRecord/9929777...,,"Pietro, the Roman school, who was by Giotto, w...",...,,CAVALLINI,Article,354,354,[],[],[],"[-0.0121824667, 0.0402460583, 0.00665577360000...",https://w3id.org/hto/Concept/8493321351_1
3,https://w3id.org/hto/Edition/997902523804341,1,"Second edition, Volume 1, A-AST",encyclopedia,Edinburgh,1778,2.0,https://w3id.org/hto/ArticleTermRecord/9979025...,,"in antiquity, a denomination given to the sena...",...,,JEINATTE,Article,117,117,[],[],[],"[0.0032457884, -0.0179126002, -0.0030581676000...",https://w3id.org/hto/Concept/2241879145_1
4,https://w3id.org/hto/Edition/9910796233804340,8,"Fourth edition, Volume 8 Part 1, ELE-FAI",encyclopedia,Edinburgh,1810,4.0,https://w3id.org/hto/ArticleTermRecord/9910796...,,"a French term, sometimes life authors to denot...",...,,ESCORT,Article,351,351,[],[],[],"[0.0191970598, -0.026992561300000002, 0.029633...",https://w3id.org/hto/Concept/4160462161_1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
177578,https://w3id.org/hto/Edition/9910796373804340,4,"Supplement to the fourth, fifth and sixth edit...",encyclopedia,Edinburgh,1824,,https://w3id.org/hto/TopicTermRecord/991079637...,,Enwology.and those with sheathed wings; he obs...,...,The modifications in the form of the antennas ...,ENTOMOLOGY,Topic,267,291,[],[],"[6, 4, 5]","[0.0345440656, 0.0038321146, 0.0129572209, -0....",
177579,https://w3id.org/hto/Edition/9910796273804340,8,"Seventh edition, Volume 8, DIA-England",encyclopedia,Edinburgh,1842,7.0,https://w3id.org/hto/TopicTermRecord/991079627...,,"> EGYPT. 459 vpt. to illustrate the history, l...",...,and Age of the Monarch determined.—His Charact...,DCTT,Topic,469,469,[],[],[],"[0.0283674542, 0.0216944125, -0.0273680091, -0...",
177580,https://w3id.org/hto/Edition/9910796233804340,20,"Fourth edition, Volume 20 Part 1, SUI-THE",encyclopedia,Edinburgh,1810,4.0,https://w3id.org/hto/TopicTermRecord/991079623...,,^Conftruc^ a=: S7r~ and we go at > after divid...,...,^Conftruc^ a=: S7r~ and we go at > after divid...,TRICON,Topic,91,92,[],[{'uri': 'https://w3id.org/hto/ArticleTermReco...,[],"[-0.017611675, -0.1025323346, -0.0331101865, 0...",
177581,https://w3id.org/hto/Edition/997902543804341,14,"Third edition, Volume 14, PAS-PLA",encyclopedia,Edinburgh,1797,3.0,https://w3id.org/hto/TopicTermRecord/997902543...,,"PIN of pine.trecj, which in common languages w...",...,"PIN of pine.trecj, which in common languages w...",PIONEERS,Topic,795,800,[],[],[],"[-0.007840358700000001, 0.0051432806, -0.00761...",


## Add concepts, and link concepts and terms in knowledge graph

In [1]:
# load the graph
from rdflib import Graph, Namespace, RDF

# Load the existing graph
graph = Graph()
graph.parse(location="../hto.ttl", format="turtle")

hto = Namespace("https://w3id.org/hto#")
len(graph)

382

In [10]:
import pandas as pd
eb_kg_df_with_concept_uris = pd.read_json("../eb_kg_hq_normalised_embeddings_concepts_dataframe", orient="index")
eb_kg_df_with_concept_uris

Unnamed: 0,edition_uri,vol_num,vol_title,genre,print_location,year_published,edition_num,term_uri,note,description,...,summary,term_name,term_type,start_page_num,end_page_num,alter_names,reference_terms,supplements_to,embedding,concept_uri
0,https://w3id.org/hto/Edition/9922270543804340,11,"Fifth edition, Volume 11, HYD-LIE",encyclopedia,Edinburgh,1815,5.0,https://w3id.org/hto/ArticleTermRecord/9922270...,,"in antiquity, a kind of mournful song, used up...",...,,JALEMUS,Article,32,32,[],[],[],"[0.069103919, -0.0869860128, 0.015464490300000...",https://w3id.org/hto/Concept/1475160240_1
1,https://w3id.org/hto/Edition/997902543804341,7,"Third edition, Volume 7, ETM-GOA",encyclopedia,Edinburgh,1797,3.0,https://w3id.org/hto/ArticleTermRecord/9979025...,,"among miners, signifies a piece of earth it wh...",...,,GLEBE,Article,862,862,[],[],[],"[-0.0092582917, -0.0393608212, 0.0152903702, 0...",https://w3id.org/hto/Concept/2743363218_2
2,https://w3id.org/hto/Edition/9929777383804340,6,"Eighth edition, Volume 6, Burning glasses-Climate",encyclopedia,Edinburgh,1853,8.0,https://w3id.org/hto/ArticleTermRecord/9929777...,,"Pietro, the Roman school, who was by Giotto, w...",...,,CAVALLINI,Article,354,354,[],[],[],"[0.0009889967, 0.045553144100000005, -0.000591...",https://w3id.org/hto/Concept/8493321351_1
3,https://w3id.org/hto/Edition/997902523804341,1,"Second edition, Volume 1, A-AST",encyclopedia,Edinburgh,1778,2.0,https://w3id.org/hto/ArticleTermRecord/9979025...,,"in antiquity, a denomination given to the sena...",...,,JEINATTE,Article,117,117,[],[],[],"[-0.014175045300000001, 0.0373287126, -0.01205...",https://w3id.org/hto/Concept/2241879145_1
4,https://w3id.org/hto/Edition/9910796233804340,8,"Fourth edition, Volume 8 Part 1, ELE-FAI",encyclopedia,Edinburgh,1810,4.0,https://w3id.org/hto/ArticleTermRecord/9910796...,,"a French term, sometimes life authors to denot...",...,,ESCORT,Article,351,351,[],[],[],"[0.0222478937, -0.0197300091, 0.0230092779, 0....",https://w3id.org/hto/Concept/4160462161_1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
177578,https://w3id.org/hto/Edition/9910796373804340,4,"Supplement to the fourth, fifth and sixth edit...",encyclopedia,Edinburgh,1824,,https://w3id.org/hto/TopicTermRecord/991079637...,,Enwology.and those with sheathed wings; he obs...,...,The modifications in the form of the antennas ...,ENTOMOLOGY,Topic,267,291,[],[],"[6, 4, 5]","[0.0316150486, -0.0176730026, 0.0031530608, -0...",https://w3id.org/hto/Concept/115254768_3
177579,https://w3id.org/hto/Edition/9910796273804340,8,"Seventh edition, Volume 8, DIA-England",encyclopedia,Edinburgh,1842,7.0,https://w3id.org/hto/TopicTermRecord/991079627...,,"> EGYPT. 459 vpt. to illustrate the history, l...",...,and Age of the Monarch determined.—His Charact...,DCTT,Topic,469,469,[],[],[],"[0.031784635, 0.0405623764, -0.0190234669, -0....",https://w3id.org/hto/Concept/8357791245_1
177580,https://w3id.org/hto/Edition/9910796233804340,20,"Fourth edition, Volume 20 Part 1, SUI-THE",encyclopedia,Edinburgh,1810,4.0,https://w3id.org/hto/TopicTermRecord/991079623...,,^Conftruc^ a=: S7r~ and we go at > after divid...,...,^Conftruc^ a=: S7r~ and we go at > after divid...,TRICON,Topic,91,92,[],[{'uri': 'https://w3id.org/hto/ArticleTermReco...,[],"[-0.0279026795, -0.0963665769, -0.0473303795, ...",https://w3id.org/hto/Concept/9057981969_3
177581,https://w3id.org/hto/Edition/997902543804341,14,"Third edition, Volume 14, PAS-PLA",encyclopedia,Edinburgh,1797,3.0,https://w3id.org/hto/TopicTermRecord/997902543...,,"PIN of pine.trecj, which in common languages w...",...,"PIN of pine.trecj, which in common languages w...",PIONEERS,Topic,795,800,[],[],[],"[0.0154414643, 0.0240709074, -0.0121394219, 0....",https://w3id.org/hto/Concept/8562170582_2


In [11]:
concept_uris = eb_kg_df_with_concept_uris["concept_uri"].unique()
concept_uris

array(['https://w3id.org/hto/Concept/1475160240_1',
       'https://w3id.org/hto/Concept/2743363218_2',
       'https://w3id.org/hto/Concept/8493321351_1', ...,
       'https://w3id.org/hto/Concept/3183165528_1',
       'https://w3id.org/hto/Concept/8357791245_1',
       'https://w3id.org/hto/Concept/9057981969_3'], dtype=object)

In [12]:
from rdflib import URIRef
from tqdm import tqdm

for concept_uri in tqdm(concept_uris):
    if concept_uri:
        concept_terms_df = eb_kg_df_with_concept_uris[eb_kg_df_with_concept_uris["concept_uri"] == concept_uri]
        concept_uriref = URIRef(concept_uri)
        graph.add((concept_uriref, RDF.type, hto.Concept))
        for index, row in concept_terms_df.iterrows():
            term_uri_ref = URIRef(row["term_uri"])
            graph.add((concept_uriref, hto.hadConceptRecord, term_uri_ref))

    else:
        print("None")


100%|██████████| 88504/88504 [15:26<00:00, 95.49it/s] 


In [13]:
graph.serialize(format="turtle", destination="../results/extra_concepts_records_link.ttl")

<Graph identifier=N1974a4f60ffb4f5594114b9111592381 (<class 'rdflib.graph.Graph'>)>