# Single Source EB Dataframe to RDF

This notebook creates rdf triples from EB dataframe based on the [Heritage Text Ontology](https://github.com/frances-ai/HeritageTextOntology), and export it as ttl file. It will only take dataframe whose text content was extracted from single source for each edition. That's to say, one EB term will only have one original description.

## Load and check dataframe

Per entry in dataframes, it should have the following columns (see an example of one entry of the first edition):

- MMSID:
- editor:                                                  Smellie, William
- editor_date:                                                   1740-1795
- genre:                                                       encyclopedia
- language:                                                             eng
- termsOfAddress:                                                       NaN
- physicalDescription:               3 v., 160 plates : ill. ; 26 cm. (4to)
- place:                                                         Edinburgh
- publisher:              Printed for A. Bell and C. Macfarquhar; and so...
- referencedBy:           [Alston, R.C.  Engl. language III, 560, ESTC T...
- shelfLocator:                                                        EB.1
- editionSubTitle:        Illustrated with one hundred and sixty copperp...
- volumeTitle:            Encyclopaedia Britannica; or, A dictionary of ...
- year:                                                                1771
- volumeId:                                                       144133901
- permanentURL:                            https://digital.nls.uk/144133901
- publisherPersons:                     [C. Macfarquhar, Colin Macfarquhar]
- volumeNum:                                                              1
- letters:                                                              A-B
- part:                                                                   0
- editionNum:                                                             1
- supplementTitle:
- supplementSubTitle:
- supplementsTo:                                                         []
- term:                                                                  OR
- definition:             A NEW A D I C T I A A, the name of several riv...
- reference_terms:                                                          []
- header:                                           EncyclopaediaBritannica
- startsAt:                                                              15
- endsAt:                                                                15
- position:                                                           0
- termType:                                                         Article
- filePath:                                  144133901/alto/188082904.34.xml

In [26]:
import re

import pandas as pd

# Nineteenth-Century Knowledge Project Dataframe
df = pd.read_json('../../source_dataframes/eb/nckp_final_eb_7_dataframe_clean_Damon', orient="index")
df =df.fillna(0)

len(df)

23970

In [27]:
edition_mmsids = df["MMSID"].unique()
print(edition_mmsids)

[9910796273804340]


In [28]:
df.rename(columns={'editionTitle': 'volumeTitle', 'volumeTitle': 'editionTitle'}, inplace=True)

In [29]:
df_edition = df[df["MMSID"] == edition_mmsids[0]].reset_index(drop=True)
df_edition

Unnamed: 0,term,note,alter_names,reference_terms,definition,startsAt,endsAt,position,termType,filePath,...,volumeId,permanentURL,publisherPersons,volumeNum,editionNum,numberOfVolumes,numberOfTerms,supplementTitle,supplementSubTitle,supplementsTo
0,A,0,[],[],The first letter of the alphabet in every know...,11,12,1,Article,./eb07_TXT_v2/a2/kp-eb0702-000101-9822-v2.txt,...,192984259,https://digital.nls.uk/192984259,[],2,7,22,0.0,,,[]
1,A,0,[],[],"as an abbreviation, is likewise of frequent oc...",12,12,2,Article,./eb07_TXT_v2/a2/kp-eb0702-000101-9822-v2.txt,...,192984259,https://digital.nls.uk/192984259,[],2,7,22,0.0,,,[]
2,AA,0,[],[],"a river of the province of Groningen, in the k...",12,12,3,Article,./eb07_TXT_v2/a2/kp-eb0702-000201-9835-v2.txt,...,192984259,https://digital.nls.uk/192984259,[],2,7,22,0.0,,,[]
3,AA,0,[],[],a river in the province of Overyssel. in the N...,12,12,4,Article,./eb07_TXT_v2/a2/kp-eb0702-000201-9835-v2.txt,...,192984259,https://digital.nls.uk/192984259,[],2,7,22,0.0,,,[]
4,AA,0,[],[],"a river of the province of Antwerp, in the Net...",12,12,5,Article,./eb07_TXT_v2/a2/kp-eb0702-000201-9835-v2.txt,...,192984259,https://digital.nls.uk/192984259,[],2,7,22,0.0,,,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23965,ZWENIGORODKA,0,[],[],a circle of the Russian government of Kiew. It...,1037,1037,4,Article,./eb07_TXT_v2/z21/kp-eb0721-102704-1077-v2.txt,...,193819045,https://digital.nls.uk/193819045,[],21,7,22,0.0,,,[]
23966,ZWICKAU,0,[],[],"a city of the kingdom of Saxony, the capital o...",1037,1037,5,Article,./eb07_TXT_v2/z21/kp-eb0721-102705-1077-v2.txt,...,193819045,https://digital.nls.uk/193819045,[],21,7,22,0.0,,,[]
23967,ZWOLLE,0,[],[],"a city, the capital of the circle of the same ...",1037,1037,6,Article,./eb07_TXT_v2/z21/kp-eb0721-102706-1077-v2.txt,...,193819045,https://digital.nls.uk/193819045,[],21,7,22,0.0,,,[]
23968,ZYGHUR,0,[],[],"a town of Hindustan, in the province of Bejapo...",1037,1037,7,Article,./eb07_TXT_v2/z21/kp-eb0721-102707-1077-v2.txt,...,193819045,https://digital.nls.uk/193819045,[],21,7,22,0.0,,,[]


In [30]:
df_earth = df[df["term"] == "EARTH"].reset_index(drop=True)
df_earth

Unnamed: 0,term,note,alter_names,reference_terms,definition,startsAt,endsAt,position,termType,filePath,...,volumeId,permanentURL,publisherPersons,volumeNum,editionNum,numberOfVolumes,numberOfTerms,supplementTitle,supplementSubTitle,supplementsTo
0,EARTH,0,[],[],"amongst ancient philosophers, owe of the four ...",401,401,13,Article,./eb07_TXT_v2/e8/kp-eb0708-039107-8218-v2.txt,...,193322688,https://digital.nls.uk/193322688,[],8,7,22,0.0,,,[]
1,EARTH,0,[],[FIGURE OF THE EARTH],"in Astronomy and Geography, one of the primary...",401,401,14,Article,./eb07_TXT_v2/e8/kp-eb0708-039107-8218-v2.txt,...,193322688,https://digital.nls.uk/193322688,[],8,7,22,0.0,,,[]


## Load EB-Ontology, and import data from NLS dataframe to it

In [31]:
from rdflib import Graph, URIRef, Namespace

# Create a new RDFLib Graph
graph = Graph()

# Load your ontology file into the graph
ontology_file = "../hto.ttl"
graph.parse(ontology_file, format="turtle")
hto = Namespace("https://w3id.org/hto#")


In [32]:
# Print the number of "triples" in the Graph
print(f"Graph g has {len(graph)} statements.")

Graph g has 536 statements.


In [33]:
# load metadata
import pandas as pd
metadata_df = pd.read_json("../source_dataframes/eb/nls_metadata_dataframe", orient="index")

In [34]:
from datetime import datetime
from rdflib import Literal, XSD, RDF, RDFS
from rdflib.namespace import FOAF, PROV, SDO
def create_collection():
    collection = URIRef("https://w3id.org/hto/WorkCollection/EncyclopaediaBritannica")
    graph.add((collection, RDF.type, hto.WorkCollection))
    graph.add((collection, hto.name, Literal("Encyclopaedia Britannica Collection", datatype=XSD.string)))
    return collection

In [35]:
collection = create_collection()

In [36]:
import regex

NON_AZ09_REGEXP = regex.compile('[^\p{L}\p{N}]')
def name_to_uri_name(name):
    uri_name=NON_AZ09_REGEXP.sub('', name)
    return uri_name

In [37]:
# create edition uri list based on edition number
from rdflib import URIRef
def get_edition_uri_by_number(metadata_df):
    edition_uris = {}
    metadata_df_without_supplement = metadata_df[metadata_df["editionNum"] > 0]
    edition_nums = metadata_df_without_supplement["editionNum"].unique()
    for edition_num in edition_nums:
        edition_df = metadata_df_without_supplement[metadata_df_without_supplement["editionNum"] == edition_num]
        edition_df = edition_df.iloc[0]
        edition_uri = URIRef("https://w3id.org/hto/Edition/" + str(edition_df["MMSID"]))
        edition_uris[edition_num] = edition_uri

    return edition_uris

In [38]:
edition_uris = get_edition_uri_by_number(metadata_df)

In [39]:

def edition2rdf(edition_info, graph, hto):

    # create triples with general datatype
    edition = URIRef("https://w3id.org/hto/Edition/"+str(edition_info["MMSID"]))
    graph.add((edition, RDF.type, hto.Edition))
    graph.add((collection, hto.hadMember, edition))

    # check if it is supplement edition
    supplmentsTo = edition_info["supplementsTo"]
    edition_num = int(edition_info["editionNum"])
    if edition_num == 0 and len(supplmentsTo) > 0 and supplmentsTo[0] != '':
        edition_title= str(edition_info["supplementTitle"])
        subtitle = edition_info["supplementSubTitle"]
        for to_edition_num in supplmentsTo:
            if to_edition_num in edition_uris.keys():
                to_edition_uri = edition_uris[to_edition_num]
                graph.add((edition, hto.wasSupplementOf, to_edition_uri))
    else:
        edition_title= str(edition_info["editionTitle"])
        subtitle = edition_info["editionSubTitle"]
        graph.add((edition, hto.number, Literal(edition_num, datatype=XSD.int)))

    graph.add((edition, hto.title, Literal(edition_title, datatype=XSD.string)))
    if subtitle != 0 and subtitle != "":
        graph.add((edition, hto.subtitle, Literal(edition_info["editionSubTitle"], datatype=XSD.string)))

    # publish_year = datetime.strptime(str(edition_info["year"]), "%Y")
    graph.add((edition, hto.yearPublished, Literal(int(edition_info["year"]), datatype=XSD.int)))
    # create a Location instance for printing place
    place_name = str(edition_info["place"])
    place_uri_name = name_to_uri_name(place_name)
    place = URIRef("https://w3id.org/hto/Location/"+place_uri_name)
    graph.add((place, RDF.type, hto.Location))
    graph.add((place, RDFS.label, Literal(place_name, datatype=XSD.string)))
    graph.add((edition, hto.printedAt, place))

    graph.add((edition, hto.mmsid, Literal(str(edition_info["MMSID"]), datatype=XSD.string)))
    graph.add((edition, hto.physicalDescription, Literal(edition_info["physicalDescription"], datatype=XSD.string)))
    graph.add((edition, hto.genre, Literal(edition_info["genre"], datatype=XSD.string)))
    graph.add((edition, hto.language, Literal(edition_info["language"], datatype=XSD.string)))

    # create a Location instance for shelf locator
    shelf_locator_name = str(edition_info["shelfLocator"])
    shelf_locator_uri_name = name_to_uri_name(shelf_locator_name)
    shelf_locator = URIRef("https://w3id.org/hto/Location/"+shelf_locator_uri_name)
    graph.add((shelf_locator, RDF.type, hto.Location))
    graph.add((shelf_locator, RDFS.label, Literal(shelf_locator_name, datatype=XSD.string)))
    graph.add((edition, hto.shelfLocator, shelf_locator))

    ## Editor
    if edition_info["editor"] != 0:
        editor_name=str(edition_info["editor"])
        editor_uri_name = name_to_uri_name(editor_name)
        if editor_name != "":
            editor = URIRef("https://w3id.org/hto/Person/"+str(editor_uri_name))
            graph.add((editor, RDF.type, hto.Person))
            graph.add((editor, FOAF.name, Literal(editor_name, datatype=XSD.string)))

        if edition_info["editor_date"]!=0:
            tmpDate=edition_info["editor_date"].split("-")
            birthYear=int(tmpDate[0])
            deathYear=int(tmpDate[1])
            graph.add((editor, hto.birthYear, Literal(birthYear, datatype=XSD.int)))
            graph.add((editor, hto.deathYear, Literal(deathYear, datatype=XSD.int)))

        if edition_info["termsOfAddress"] != 0:
            graph.add((editor, hto.termsOfAddress, Literal(edition_info["termsOfAddress"], datatype=XSD.string)))

        graph.add((edition, hto.editor, editor))

    #### Publishers Persons

    #This was the result to pass entity recognition to publisher

    if edition_info["publisherPersons"] != 0 and len(edition_info["publisherPersons"]) > 0:
        publisherPersons=edition_info["publisherPersons"]
        print(publisherPersons)
        if len(publisherPersons) == 1:
            publisher_name = publisherPersons[0]
            iri_publisher_name = name_to_uri_name(publisher_name)
            if iri_publisher_name != "":
                publisher = URIRef("https://w3id.org/hto/Person/"+iri_publisher_name)
                graph.add((publisher, RDF.type, hto.Person))
        else:
            iri_publisher_name = ""
            publisher_name = ""
            for p in publisherPersons:
                publisher_name = publisher_name + ", " + p
                iri_publisher_name= name_to_uri_name(publisher_name)
                if iri_publisher_name == "":
                    break
            publisher = URIRef("https://w3id.org/hto/Organization/"+iri_publisher_name)
            graph.add((publisher, RDF.type, hto.Organization))

        graph.add((publisher, FOAF.name, Literal(publisher_name, datatype=XSD.string)))
        graph.add((edition, hto.publisher, publisher))

        # Creat an instance of publicationActivity
        #publication_activity = URIRef("https://w3id.org/hto/Activity/"+ "publication" + str(edition_info["MMSID"]))
        #graph.add((publication_activity, RDF.type, PROV.Activity))
        #graph.add((publication_activity, PROV.generated, edition))
        #graph.add((publication_activity, PROV.endedAtTime, Literal(publish_year, datatype=XSD.dateTime)))
        #graph.add((publication_activity, PROV.wasEndedBy, publisher))
        #graph.add((edition, PROV.wasGeneratedBy, publication_activity))

    #### Is Referenced by

    if edition_info["referencedBy"] != 0:
        references=edition_info["referencedBy"]
        for r in references:
            book_name = str(r)
            book_uri_name = name_to_uri_name(book_name)
            book = URIRef("https://w3id.org/hto/Book/"+book_uri_name)
            graph.add((book, RDF.type, hto.Book))
            graph.add((book, hto.name, Literal(book_name, datatype=XSD.string)))
            graph.add((edition, hto.referencedBy, book))

    return edition

In [40]:
def volume2rdf(volume_info, edition, graph, hto):
    volume_id=str(volume_info["volumeId"])
    volume = URIRef("https://w3id.org/hto/Volume/"+str(volume_info["MMSID"])+"_"+str(volume_id))
    graph.add((volume, RDF.type, hto.Volume))
    graph.add((volume, hto.number, Literal(volume_info["volumeNum"], datatype=XSD.integer)))
    if volume_info["letters"] != 0 and volume_info["letters"] != "":
        graph.add((volume, hto.letters, Literal(volume_info["letters"], datatype=XSD.string)))
    graph.add((volume, hto.volumeId, Literal(volume_id, datatype=XSD.string)))
    graph.add((volume, hto.title, Literal(volume_info["volumeTitle"], datatype=XSD.string)))

    if volume_info["part"]!=0:
        graph.add((volume, hto.part, Literal(volume_info["part"], datatype=XSD.integer)))

    permanentURL = URIRef(str(volume_info["permanentURL"]))
    graph.add((permanentURL, RDF.type, hto.Location))
    graph.add((volume, hto.permanentURL, permanentURL))
    # graph.add((volume, hto.numberOfPages, Literal(volume_info["numberOfPages"], datatype=XSD.integer)))
    graph.add((edition, RDF.type, hto.WorkCollection))
    graph.add((edition, hto.hadMember, volume))
    graph.add((volume, hto.wasMemberOf, edition))

    return volume

In [41]:
def get_term_class_name_and_term_ref(term_type, term_id, hto):
    term_ref = URIRef("https://w3id.org/hto/" + term_type + "/" + term_id)
    term_class_name = hto.ArticleTermRecord
    if term_type == "TopicTermRecord":
        term_class_name = hto.TopicTermRecord
    return term_class_name, term_ref

In [42]:
def get_source_ref(filePath, agent):
    if agent == "NCKP":
        parts = filePath.split("/")
        if len(parts) < 3:
            raise Exception("Wrong input format")
        edition_parts = parts[-3].split("_", 1)
        file_uri = "https://raw.githubusercontent.com/TU-plogan/kp-editions/main/" + edition_parts[0] + "/" +  edition_parts[1] + "/" +  parts[-2] + "/" + parts[-1]
        source_ref = URIRef(file_uri)
    else:
        source_uri_name = filePath.replace("/", "_").replace(".", "_")
        source_ref = URIRef("https://w3id.org/hto/InformationResource/" + source_uri_name)
    return source_ref

In [43]:
# test function get_source_ref
test_get_source_ref_inputs = [
    {"filePath": "./eb07_TXT_v2/e8/kp-eb0708-039107-8218-v2.txt", "agent": "NCKP"},
    {"filePath": "144133901/alto/188082904.34.xml", "agent": "NLS"}
]
for input in test_get_source_ref_inputs:
    print(get_source_ref(input["filePath"], input["agent"]))

https://raw.githubusercontent.com/TU-plogan/kp-editions/main/eb07/TXT_v2/e8/kp-eb0708-039107-8218-v2.txt
https://w3id.org/hto/InformationResource/144133901_alto_188082904_34_xml


In [44]:
# create software uris
defoe = URIRef("https://github.com/defoe-code/defoe")
graph.add((defoe, RDF.type, hto.SoftwareAgent))
frances_information_extraction = URIRef("https://github.com/frances-ai/frances-InformationExtraction")
graph.add((frances_information_extraction, RDF.type, hto.SoftwareAgent))
ABBYYFineReader = URIRef("https://pdf.abbyy.com")
graph.add((ABBYYFineReader, RDF.type, hto.SoftwareAgent))

<Graph identifier=N871334772dad4f2a9cfbf10a516d7cb5 (<class 'rdflib.graph.Graph'>)>

In [45]:
def link_entity_with_software(graph, entity, entity_type, agent):
    software = None
    if entity_type == "description":
        if agent == "NLS":
            software = defoe
        else:
            software = frances_information_extraction
    else:
        if agent == "NCKP":
            software = ABBYYFineReader

    if software:
        graph.add((entity, PROV.wasAttributedTo, software))

In [46]:
previous_edition = {}

In [47]:
import re
# dataframe_to_RDF()
def dataframe_to_rdf(dataframe, graph, hto, agent_uri, agent, eb_dataset):
    dataframe=dataframe.fillna(0)
    dataframe["id"] = dataframe.index
    # create triples
    edition_mmsids = dataframe["MMSID"].unique()

    for mmsid in edition_mmsids:
        df_edition = dataframe[dataframe["MMSID"] == mmsid].reset_index(drop=True)

        edition_num = int(df_edition.loc[0, "editionNum"])
        year_published = int(df_edition.loc[0, "year"])

        #if edition_num == 1 and year_published == 1771 and agent == "NLS":
            #continue
        # exchange the column volume title with edition title, note that this should be done when extract the metadata. This should be removed when it is fixed during information extraction.
        if edition_num > 0:
            df_edition.rename(columns={'editionTitle': 'volumeTitle', 'volumeTitle': 'editionTitle'}, inplace=True)

        edition_info = df_edition.loc[0]
        edition_ref = edition2rdf(edition_info, graph, hto)

        if edition_num != 0:
            # not supplement
            if edition_num in previous_edition.keys():
                # add revision info
                if previous_edition[edition_num]["year"] < year_published:
                    print("revision")
                    graph.add((edition_ref, PROV.wasRevisionOf, previous_edition[edition_num]["uri"]))
                elif previous_edition[edition_num]["year"] > year_published:
                    graph.add((previous_edition[edition_num]["uri"], PROV.wasRevisionOf, edition_ref))
                else:
                    print("equal")
            else:
                previous_edition[edition_num] = {
                    "year": year_published,
                    "uri": edition_ref
                }

        # VOLUMES
        vol_numbers = df_edition["volumeNum"].unique()
        # graph.add((edition_ref, hto.numberOfVolumes, Literal(len(vol_numbers), datatype=XSD.integer)))
        for vol_number in vol_numbers:
            df_vol = df_edition[df_edition["volumeNum"] == vol_number].reset_index(drop=True)
            volume_info = df_vol.loc[0]
            volume_ref = volume2rdf(volume_info, edition_ref, graph, hto)
            # print(volume_info)
            df_vol_by_term=df_vol.groupby(['term'],)["term"].count().reset_index(name='counts')
            # print(df_vol_by_term)

            #### TERMS
            for t_index in range(0, len(df_vol_by_term)):
                term=df_vol_by_term.loc[t_index]["term"]
                term_counts=df_vol_by_term.loc[t_index]["counts"]
                term_uri_name = name_to_uri_name(term)
                # print(term_uri_name)
                # All terms in one volume with name equals to value of term
                df_entries= df_vol[df_vol["term"] == term].reset_index(drop=True)
                for t_count in range(0, term_counts):
                    df_entry= df_entries.loc[t_count]
                    term_id = str(mmsid)+"_"+str(df_entry["volumeId"])+"_"+term_uri_name+"_"+str(t_count)
                    term_type = str(df_entry["termType"]) + "TermRecord"

                    term_class_name, term_ref = get_term_class_name_and_term_ref(term_type, term_id, hto)

                    # Add the term_ref to dataframe
                    dataframe_equal = (dataframe['id'] == df_entry['id'])
                    dataframe.loc[dataframe_equal, "uri"] = term_ref

                    graph.add((term_ref, RDF.type, term_class_name))
                    graph.add((term_ref, hto.name, Literal(term, datatype=XSD.string)))
                    if "note" in df_entry:
                        note = df_entry["note"]
                        if note != 0:
                            graph.add((term_ref, hto.note, Literal(note, datatype=XSD.string)))

                    if "alter_names" in df_entry:
                        alter_names = df_entry["alter_names"]
                        for alter_name in alter_names:
                            graph.add((term_ref, hto.name, Literal(alter_name, datatype=XSD.string)))

                    # Create original description instance
                    description = df_entry["definition"]
                    if description != "":

                        term_original_description = URIRef("https://w3id.org/hto/OriginalDescription/" + str(df_entry["MMSID"])+"_"+str(df_entry["volumeId"])+"_"+term_uri_name+"_"+str(t_count)+agent)
                        graph.add((term_original_description, RDF.type, hto.OriginalDescription))
                        text_quality = hto.Low
                        if agent == "Ash":
                            text_quality = hto.Moderate
                        elif agent == "NCKP":
                            text_quality = hto.High
                        graph.add((term_original_description, hto.hasTextQuality, text_quality))
                        # graph.add((term_original_description, hto.numberOfWords, Literal(df_entry["numberOfWords"], datatype=XSD.int)))
                        graph.add((term_original_description, hto.text, Literal(df_entry["definition"], datatype=XSD.string)))

                        graph.add((term_ref, hto.hasOriginalDescription, term_original_description))
                        graph.add((term_ref, hto.position, Literal(df_entry["position"], datatype=XSD.int)))

                        link_entity_with_software(graph, term_original_description, "description", agent)

                        # Create source entity where original description was extracted
                        # source location
                        # source_path_name = df_entry["altoXML"]
                        # source_path_ref = URIRef("https://w3id.org/eb/Location/" + source_path_name)
                        # graph.add((source_path_ref, RDF.type, PROV.Location))
                        # source
                        file_path = str(df_entry["filePath"])
                        source_ref = get_source_ref(file_path, agent)
                        graph.add((source_ref, RDF.type, hto.InformationResource))
                        graph.add((source_ref, PROV.value, Literal(file_path, datatype=XSD.string)))
                        graph.add((eb_dataset, hto.hadMember, source_ref))
                        graph.add((source_ref, PROV.wasAttributedTo, agent_uri))
                        link_entity_with_software(graph, source_ref, "source", agent)

                        #graph.add((source_ref, PROV.atLocation, source_path_ref))
                        # related agent and activity


                        """
                        source_digitalising_activity = URIRef("https://w3id.org/eb/Activity/nls_digitalising_activity" + source_name)
                        graph.add((source_digitalising_activity, RDF.type, PROV.Activity))
                        graph.add((source_digitalising_activity, PROV.generated, source_ref))
                        graph.add((source_digitalising_activity, PROV.wasAssociatedWith, nls))
                        graph.add((source_ref, PROV.wasGeneratedBy, source_digitalising_activity))
                        """
                        graph.add((term_original_description, hto.wasExtractedFrom, source_ref))

                    ## startsAt
                    page_startsAt= URIRef("https://w3id.org/hto/Page/"+ str(df_entry["MMSID"])+"_"+str(df_entry["volumeId"])+"_"+str(df_entry["startsAt"]))
                    graph.add((page_startsAt, RDF.type, hto.Page))
                    graph.add((page_startsAt, hto.number, Literal(df_entry["startsAt"], datatype=XSD.int)))
                    if df_entry["header"] != 0 and df_entry["header"] != "":
                        graph.add((page_startsAt, hto.header, Literal(df_entry["header"], datatype=XSD.string)))
                    # graph.add((page_startsAt, hto.numberOfTerms, Literal(df_entry["numberOfTerms"], datatype=XSD.int)))
                    graph.add((volume_ref, RDF.type, hto.WorkCollection))
                    graph.add((volume_ref, hto.hadMember, page_startsAt))
                    graph.add((term_ref, hto.startsAtPage, page_startsAt))
                    graph.add((page_startsAt, RDF.type, hto.WorkCollection))
                    graph.add((page_startsAt, hto.hadMember, term_ref))

                    ## endsAt
                    page_endsAt= URIRef("https://w3id.org/hto/Page/"+ str(df_entry["MMSID"])+"_"+str(df_entry["volumeId"])+"_"+str(df_entry["endsAt"]))
                    graph.add((page_endsAt, RDF.type, hto.Page))
                    graph.add((page_endsAt, hto.number, Literal(df_entry["endsAt"], datatype=XSD.int)))
                    # graph.add((page_endsAt, hto.numberOfTerms, Literal(df_entry["numberOfTerms"], datatype=XSD.int)))
                    graph.add((volume_ref, hto.hadMember, page_endsAt))
                    graph.add((term_ref, hto.endsAtPage, page_endsAt))
                    graph.add((page_endsAt, RDF.type, hto.WorkCollection))
                    graph.add((page_endsAt, hto.hadMember, term_ref))

    return graph, dataframe

In [48]:
from datetime import datetime
from rdflib import Literal, XSD, RDF
from rdflib.namespace import FOAF, PROV, SDO
# create organization NCKP
agents = {
    "NCKP": ["Nineteen Century Knowledge Project", hto.Organization],
    "Ash": ["Ash Charlton", hto.Person],
    "NLS": ["National Library of Scotland", hto.Organization]
}

def create_organization(graph, agent):
    agent_uri = URIRef("https://w3id.org/hto/Organization/" + agent)
    graph.add((agent_uri, RDF.type, agents[agent][1]))
    graph.add((agent_uri, FOAF.name, Literal(agents[agent][0], datatype=XSD.string)))
    return agent_uri

In [49]:
def create_eb_text_dataset(graph, agent_uri, agent):
    eb_text_dataset = URIRef("https://w3id.org/hto/Collection/" + agent + "_eb_dataset")
    graph.add((eb_text_dataset, RDF.type, PROV.Collection))
    graph.add((eb_text_dataset, PROV.wasAttributedTo, agent_uri))

    # Create digitalising activity
    digitalising_activity = URIRef("https://w3id.org/hto/Activity/" + agent + "_digitalising_activity")
    graph.add((digitalising_activity, RDF.type, hto.Activity))
    graph.add((digitalising_activity, PROV.generated, eb_text_dataset))
    graph.add((digitalising_activity, PROV.wasAssociatedWith, agent_uri))
    graph.add((eb_text_dataset, PROV.wasGeneratedBy, digitalising_activity))
    return eb_text_dataset

In [50]:
import pandas as pd

datetime_with_uris_list = []

#print(1)
# Ash Edition 1
#agent = "Ash"
#agent_uri = create_organization(graph, agent)
#eb_text_dataset = create_eb_text_dataset(graph, agent_uri, agent)
# import data from 1st edition
#df_1= pd.read_json('../source_dataframes/eb/ash_final_eb_1_dataframe_clean_Damon', orient="index")
#graph, dataframe_with_uris = dataframe_to_rdf(df_1, graph, hto,  agent_uri, agent, eb_text_dataset)

print(1)

# NLS Edition 1
agent = "NLS"
agent_uri = create_organization(graph, agent)
eb_text_dataset = create_eb_text_dataset(graph, agent_uri, agent)
# import data from 1st edition
df_1= pd.read_json('../source_dataframes/eb/final_eb_1_dataframe', orient="index")
df_1.rename(columns={"typeTerm": "termType", "positionPage": "position", "altoXML": "filePath"}, inplace=True)
graph, dataframe_with_uris = dataframe_to_rdf(df_1, graph, hto,  agent_uri, agent, eb_text_dataset)
datetime_with_uris_list.append(dataframe_with_uris)

print(2)

# NLS Edition 2

# import data from 2snd edition
df_2= pd.read_json('../source_dataframes/eb/final_eb_2_dataframe', orient="index")
df_2.rename(columns={"typeTerm": "termType", "positionPage": "position", "altoXML": "filePath"}, inplace=True)
graph, dataframe_with_uris = dataframe_to_rdf(df_2, graph, hto,  agent_uri, agent, eb_text_dataset)
datetime_with_uris_list.append(dataframe_with_uris)

print(3)

# NLS Edition 3
# import data from 3rd edition
df_3= pd.read_json('../source_dataframes/eb/final_eb_3_dataframe', orient="index")
df_3.rename(columns={"typeTerm": "termType", "positionPage": "position", "altoXML": "filePath"}, inplace=True)
graph, dataframe_with_uris = dataframe_to_rdf(df_3, graph, hto,  agent_uri, agent, eb_text_dataset)
datetime_with_uris_list.append(dataframe_with_uris)

print(4)
# NLS Edition 4
# import data from 4th edition
df_4= pd.read_json('../source_dataframes/eb/final_eb_4_dataframe', orient="index")
df_4.rename(columns={"typeTerm": "termType", "positionPage": "position", "altoXML": "filePath"}, inplace=True)
graph, dataframe_with_uris = dataframe_to_rdf(df_4, graph, hto,  agent_uri, agent, eb_text_dataset)
datetime_with_uris_list.append(dataframe_with_uris)

print(4, 5, 6)
# NLS Edition 4, 5, 6
df_456= pd.read_json('../source_dataframes/eb/final_eb_4_5_6_suplement_dataframe', orient="index")
df_456.rename(columns={"typeTerm": "termType", "positionPage": "position", "altoXML": "filePath"}, inplace=True)
graph, dataframe_with_uris = dataframe_to_rdf(df_456, graph, hto,  agent_uri, agent, eb_text_dataset)
datetime_with_uris_list.append(dataframe_with_uris)

print(5)
# NLS Edition 5
# import data from 5st edition
df_5= pd.read_json('../source_dataframes/eb/final_eb_5_dataframe', orient="index")
df_5.rename(columns={ "typeTerm": "termType", "positionPage": "position", "altoXML": "filePath"}, inplace=True)
graph, dataframe_with_uris = dataframe_to_rdf(df_5, graph, hto,  agent_uri, agent, eb_text_dataset)
datetime_with_uris_list.append(dataframe_with_uris)

print(6)
# NLS Edition 6
# import data from 6st edition
df_6= pd.read_json('../source_dataframes/eb/final_eb_6_dataframe', orient="index")
df_6.rename(columns={"typeTerm": "termType", "positionPage": "position", "altoXML": "filePath"}, inplace=True)
graph, dataframe_with_uris = dataframe_to_rdf(df_6, graph, hto,  agent_uri, agent, eb_text_dataset)
datetime_with_uris_list.append(dataframe_with_uris)

print(7)
# NLS Edition 7
#agent = "NLS"
#agent_uri = create_organization(graph, agent)
#eb_text_dataset = create_eb_text_dataset(graph, agent_uri, agent)
# import data from 1st edition
df_7= pd.read_json('../source_dataframes/eb/final_eb_7_dataframe', orient="index")
df_7.rename(columns={"typeTerm": "termType", "positionPage": "position", "altoXML": "filePath"}, inplace=True)
graph, dataframe_with_uris = dataframe_to_rdf(df_7, graph, hto,  agent_uri, agent, eb_text_dataset)
datetime_with_uris_list.append(dataframe_with_uris)

print(8)
# NLS Edition 8
# import data from 3rd edition
df_8= pd.read_json('../source_dataframes/eb/final_eb_8_dataframe', orient="index")
df_8.rename(columns={"typeTerm": "termType", "positionPage": "position", "altoXML": "filePath"}, inplace=True)
graph, dataframe_with_uris = dataframe_to_rdf(df_8, graph, hto,  agent_uri, agent, eb_text_dataset)
datetime_with_uris_list.append(dataframe_with_uris)

#print(7)
# NCKP Edition 7
#agent = "NCKP"
#agent_uri = create_organization(graph, agent)
#eb_text_dataset = create_eb_text_dataset(graph, agent_uri, agent)
# import data from 7st edition
#df_7 = pd.read_json('../source_dataframes/eb/nckp_final_eb_7_dataframe_clean_Damon', orient="index")
#graph, dataframe_with_uris = dataframe_to_rdf(df_7, graph, hto,  agent_uri, agent, eb_text_dataset)

dataframe_with_uris_total = pd.concat(datetime_with_uris_list, ignore_index=True)

1
['C. Macfarquhar', 'Colin Macfarquhar']
['John Donaldson']
revision
2
['W. Gordon', 'J. Bell', 'J. Dickson', 'C. Elliot']
3
['C. Macfarquhar']
['J. Brown']
4
4 5 6
5
6
7
8


In [55]:
dataframe_with_uris_total.iloc[0]

MMSID                                                    992277653804341
editionTitle                          First edition, 1771, Volume 1, A-B
editor                                                  Smellie, William
editor_date                                                    1740-1795
genre                                                       encyclopedia
language                                                             eng
termsOfAddress                                                       0.0
numberOfPages                                                        832
physicalDescription               3 v., 160 plates : ill. ; 26 cm. (4to)
place                                                          Edinburgh
publisher              Printed for A. Bell and C. Macfarquhar; and so...
referencedBy           [Alston, R.C.  Engl. language III, 560, ESTC T...
shelfLocator                                                        EB.1
editionSubTitle        Illustrated with one hundred

In [51]:
# store the new dataframe with uris
dataframe_with_uris_total.to_json('./final_eb_total_dataframe_lq_with_uris', orient="index")

In [54]:
# Save the Graph in the RDF Turtle format
graph.serialize(format="turtle", destination="../../results/hto_eb_total_lq.ttl")

<Graph identifier=N871334772dad4f2a9cfbf10a516d7cb5 (<class 'rdflib.graph.Graph'>)>