<h1> Creating the Knowledge Graph </h1>

This notebook takes in the [DataFrame](https://github.com/alexyoung13/frances_dissertation_ay55/blob/main/Notebooks/Create_DataFrame_7ed_EB_final.ipynb) created in the previous notebook and converts it to a [Knowledge Graph](https://github.com/alexyoung13/frances_dissertation_ay55/blob/main/Notebooks/DataFrame2RDF_7thEdition.ipynb). The graph is then saved a ttl file for future use in [creating term documents](https://github.com/alexyoung13/frances_dissertation_ay55/blob/main/Notebooks/Create_documents.ipynb). It can be uploaded to a Fuseki server or loaded manually as evidenced in the [creating term documentsnotebook](https://github.com/alexyoung13/frances_dissertation_ay55/blob/main/Notebooks/Create_documents.ipynb) .

In [23]:
from datetime import datetime
import pandas as pd
from rdflib import Graph, URIRef, Literal, Namespace, XSD
from rdflib.namespace import RDF, RDFS

In [24]:
#function to turn an edition of EB into an rdf. Will be called in the main loop later
#This may have points that seem redundant as this is only a single version of an edition but 
#it is designed this way to match the exiting EB Ontology and must be exact.
def edition2rdf(data, g, eb):
    edition = URIRef("https://w3id.org/eb/i/Edition/"+str(data["MMSID"]))
    edition_title= "Edition "+ str(data["editionNum"])+"," +str(data["year"])
    g.add((edition, RDF.type, eb.Edition))
    g.add((edition, eb.number, Literal(data["editionNum"], datatype=XSD.integer)))
    g.add((edition, eb.title, Literal(edition_title, datatype=XSD.string)))
    g.add((edition, eb.subtitle, Literal(data["editionSubTitle"], datatype=XSD.string)))
    g.add((edition, eb.publicationYear, Literal(data["year"], datatype=XSD.integer)))
    g.add((edition, eb.printedAt, Literal(data["place"], datatype=XSD.string)))
    g.add((edition, eb.mmsid, Literal(str(data["MMSID"]), datatype=XSD.string)))
    g.add((edition, eb.physicalDescription, Literal(data["physicalDescription"], datatype=XSD.string)))
    g.add((edition, eb.genre, Literal(data["genre"], datatype=XSD.string)))
    g.add((edition, eb.language, Literal(data["language"], datatype=XSD.string)))
    g.add((edition, eb.shelfLocator, Literal(data["shelfLocator"], datatype=XSD.string)))
    g.add((edition, eb.numberOfVolumes, Literal(data["numberOfVolumes"], datatype=XSD.integer)))

    #Creates an Editor in the KG
    name=data["editor"].replace(" ", "")
    editor = URIRef("https://w3id.org/eb/i/Person/"+str(name))
    g.add((editor, RDF.type, eb.Person))
    g.add((editor, eb.name, Literal(data["editor"], datatype=XSD.string)))

    #if editor date exists
    if data["editor_date"]!=0:
        tmpDate=data["editor_date"].split("-")
        birthDate=datetime.strptime(tmpDate[0], '%Y')
        deathDate=datetime.strptime(tmpDate[1], '%Y')
        g.add((editor, eb.birthDate, Literal(birthDate, datatype=XSD.dateTime)))
        g.add((editor, eb.deathDate, Literal(deathDate, datatype=XSD.dateTime)))
    
    #if terms of Address exists
    if data["termsOfAddress"] != 0:
        g.add((editor, eb.termsOfAddress, Literal(data["termsOfAddress"], datatype=XSD.string)))

    g.add((edition, eb.editor, editor))

    #Creates a Publisher Persons
    if data["publisherPersons"] != 0:
        publisherPersons=name=data["publisherPersons"]
        for p in publisherPersons: 
            name=p.replace(" ", "")
            publisher = URIRef("https://w3id.org/eb/i/Person/"+name)
            g.add((publisher, RDF.type, eb.Person))
            g.add((publisher, eb.name, Literal(p, datatype=XSD.string)))
            g.add((edition, eb.publisher, publisher))
        
    #If the edition is referenced by another book
    if data["referencedBy"] != 0:
        references=data["referencedBy"]
        for r in references: 
            name=r.replace(" ", "")
            book = URIRef("https://w3id.org/eb/i/Book/"+name)
            g.add((book, RDF.type, eb.Book))
            g.add((book, eb.title, Literal(r, datatype=XSD.string)))
            g.add((edition, eb.referencedBy, book))
            
    return g, edition

In [25]:
df= pd.read_json('../data/final_eb_7_dataframe_clean', orient="index") 
df.loc[0]

term                                                                   A
definition             The first letter of the alphabet in every know...
MMSID                                                   9910796273804340
edTitle                                   Seventh edition, General index
editor                                                   Stewart, Dugald
editor_date                                                    1753-1828
genre                                                       encyclopedia
language                                                             eng
termsOfAddress                                                       Sir
numberOfPages                                                        184
physicalDescription                                   21 v. in 22 ; 4to.
place                                                          Edinburgh
publisher                                                  A. & C. Black
referencedBy                                       

In [26]:
df=df.fillna(0)
print(df.shape)


(23121, 39)


In [27]:
# Create a Graph
g = Graph()

unique_terms = 0
total_terms = 0
article_terms = 0
topic_terms = 0

g.namespace_manager.bind('eb', Namespace("https://w3id.org/eb#"), override="False")
eb = Namespace("https://w3id.org/eb#")

#code is designed for multiple editions to be processed in the future
list_years=df["year"].unique()
ed_revisions=[]

for y in range(0, len(list_years)):
    
    #Prints the edition as it is converted
    print("YEAR %s" %list_years[y])
    
    #creates an entry for each edition
    df_year=df[df['year'] == list_years[y]].reset_index(drop=True)
    edition_data = df_year.loc[0]
    g, edition = edition2rdf(edition_data,g, eb)
    ed_revisions.append(edition)
    
    #Creates a volume entry for each in the edition
    list_vols = df_year["volumeNum"].unique()
    for v in range(0,len(list_vols)):
        print("Vol %s" % list_vols[v])
        df_year_vl=df_year[df_year["volumeNum"] == list_vols[v]].reset_index(drop=True)
        
        volume_data=df_year_vl.loc[0]
        volume_id=volume_data["volumeId"]
        volume = URIRef("https://w3id.org/eb/i/Volume/"+str(volume_data["MMSID"])+"_"+str(volume_data["volumeId"]))
        print("Volume URI is %s" %volume)
        g.add((volume, RDF.type, eb.Volume))
        g.add((volume, eb.number, Literal(volume_data["volumeNum"], datatype=XSD.integer)))
        g.add((volume, eb.letters, Literal(volume_data["letters"], datatype=XSD.string)))
        g.add((volume, eb.volumeId, Literal(volume_data["volumeId"], datatype=XSD.int)))
        g.add((volume, eb.title, Literal(volume_data["volumeTitle"], datatype=XSD.string)))
        
        if volume_data["part"]!=0:
            g.add((volume, eb.part, Literal(volume_data["part"], datatype=XSD.string)))
    
        g.add((volume, eb.metsXML, Literal(volume_data["metsXML"], datatype=XSD.string)))
        g.add((volume, eb.permanentURL, Literal(volume_data["permanentURL"], datatype=XSD.string)))
        g.add((volume, eb.numberOfPages, Literal(volume_data["numberOfPages"], datatype=XSD.string)))
    
        g.add((edition, eb.hasPart, volume))
    
        df_by_term=df_year_vl.groupby(['term'],)["term"].count().reset_index(name='counts')
                                
        #Creates a term for each in each volume
        print("Unique Terms per vol: " , len(df_by_term))
        print("Total terms per vol: ", df_by_term["counts"].sum())
        unique_terms += len(df_by_term)
        total_terms += df_by_term["counts"].sum()
        topic_terms_volume = 0
        article_terms_volume = 0
        for t_index in range(0, len(df_by_term)):
            t=df_by_term.loc[t_index]["term"]
            c=df_by_term.loc[t_index]["counts"]
            df_entries= df_year_vl[df_year_vl["term"] == t].reset_index(drop=True)
            for t_count in range(0, c):
                df_entry= df_entries.loc[t_count]
                if df_entry["typeTerm"] == "Article" :
                    term= URIRef("https://w3id.org/eb/i/Article/"+str(df_entry["MMSID"])+"_"+str(df_entry["volumeId"])+"_"+t.replace(" ", "_")+"_"+str(t_count))
                    g.add((term, RDF.type, eb.Article))
                    article_terms += 1
                    article_terms_volume += 1
                elif df_entry["typeTerm"] == "Topic" :
                    term= URIRef("https://w3id.org/eb/i/Topic/"+str(df_entry["MMSID"])+"_"+str(df_entry["volumeId"])+"_"+t.replace(" ", "_")+"_"+str(t_count))
                    g.add((term, RDF.type, eb.Topic))
                    topic_terms +=1
                    topic_terms_volume += 1
                else:
                    pass
                g.add((term, eb.name, Literal(t, datatype=XSD.string)))
                g.add((term, eb.definition, Literal(df_entry["definition"], datatype=XSD.string)))
                g.add((term, eb.position, Literal(df_entry["positionPage"], datatype=XSD.int)))
                g.add((term, eb.numberOfWords, Literal(df_entry["numberOfWords"], datatype=XSD.int)))
                g.add((volume, eb.hasPart, term))
            
                #This creates a page entry
                page_startsAt= URIRef("https://w3id.org/eb/i/Page/"+ str(df_entry["MMSID"])+"_"+str(df_entry["volumeId"])+"_"+str(df_entry["startsAt"]))
                g.add((page_startsAt, RDF.type, eb.Page))
                g.add((page_startsAt, eb.number, Literal(df_entry["startsAt"], datatype=XSD.int)))
                g.add((page_startsAt, eb.header, Literal(df_entry["header"], datatype=XSD.string)))
                g.add((page_startsAt, eb.numberOfTerms, Literal(df_entry["numberOfTerms"], datatype=XSD.string)))
                g.add((volume, eb.hasPart, page_startsAt))
                g.add((term, eb.startsAtPage, page_startsAt))
                g.add((page_startsAt, eb.hasPart, term))
                g.add((page_startsAt, eb.altoXML, Literal(df_entry["altoXML"], datatype=XSD.string)))
            
                #This is not as essential for the txt data but is included to match the XML data
                page_endsAt= URIRef("https://w3id.org/eb/i/Page/"+ str(df_entry["MMSID"])+"_"+str(df_entry["volumeId"])+"_"+str(df_entry["endsAt"]))
                g.add((page_endsAt, RDF.type, eb.Page))
                g.add((page_endsAt, eb.number, Literal(df_entry["endsAt"], datatype=XSD.int)))
                g.add((volume, eb.hasPart, page_endsAt))
                g.add((term, eb.endsAtPage, page_endsAt))
                g.add((page_endsAt, eb.hasPart, term))                
                
                #related terms are not used for this current edition but are included for future use
                #Note specified is checked for and if so this section is skipped. Change this in Create_DataFrame_7ed_EB_final.ipynb
                if df_entry["relatedTerms"] != "Not specified" and df_entry["relatedTerms"]:
                    for rt in df_entry["relatedTerms"]:
                        if rt!= t:
                            related_df_entries= df_year[df_year["term"] == rt].reset_index(drop=True)
                            list_r_vl=related_df_entries["volumeNum"].unique()
                            for r_vl in list_r_vl:
                                df_r_vl=related_df_entries[related_df_entries["volumeNum"] == r_vl].reset_index(drop=True)
                                for r_c in range (0, len(df_r_vl)):
                                    r_entry= df_r_vl.loc[r_c]
                                    if r_entry["typeTerm"] == "Article" :
                                        r_term= URIRef("https://w3id.org/eb/i/Article/"+str(r_entry["MMSID"])+"_"+str(r_entry["volumeId"])+"_"+rt+"_"+str(r_c))
                                    elif r_entry["typeTerm"] == "Topic" :
                                        r_term= URIRef("https://w3id.org/eb/i/Topic/"+str(r_entry["MMSID"])+"_"+str(r_entry["volumeId"])+"_"+rt+"_"+str(r_c))
                                        
                                    g.add((term, eb.relatedTerms, r_term))
                                    
                                    
        print("article_terms_volume: " , article_terms_volume)
        print("topic_terms_volume: " , topic_terms_volume)                        


try:
    g.add((ed_revisions[1], eb.revisionOf, ed_revisions[0]))
except:
    pass

print("Unique Terms: " , unique_terms)
print("Total Terms: " , total_terms)

YEAR 1842
Vol 2
Volume URI is https://w3id.org/eb/i/Volume/9910796273804340_192984259
Unique Terms per vol:  1636
Total terms per vol:  1898
article_terms_volume:  1796
topic_terms_volume:  102
Vol 3
Volume URI is https://w3id.org/eb/i/Volume/9910796273804340_193057500
Unique Terms per vol:  842
Total terms per vol:  948
article_terms_volume:  870
topic_terms_volume:  78
Vol 4
Volume URI is https://w3id.org/eb/i/Volume/9910796273804340_193108322
Unique Terms per vol:  1450
Total terms per vol:  1641
article_terms_volume:  1491
topic_terms_volume:  150
Vol 5
Volume URI is https://w3id.org/eb/i/Volume/9910796273804340_193696083
Unique Terms per vol:  650
Total terms per vol:  754
article_terms_volume:  677
topic_terms_volume:  77
Vol 6
Volume URI is https://w3id.org/eb/i/Volume/9910796273804340_193322690
Unique Terms per vol:  1326
Total terms per vol:  1496
article_terms_volume:  1392
topic_terms_volume:  104
Vol 7
Volume URI is https://w3id.org/eb/i/Volume/9910796273804340_193819043
Un

In [29]:
g.serialize(format="turtle", destination="../data/edition7_clean.ttl")

<Graph identifier=N305283ddb53443d1ba263925dce9841f (<class 'rdflib.graph.Graph'>)>