# Chapbook Dataframe to RDF

This notebook creates rdf triples from Chapbook dataframe based on the HTO-Ontology, and export it as ttl file.  The data in the dataframe was extracted from single source.

## Load and check dataframe from NLS

Per entry in these new dataframes we will have the following columns:

- MMSID                                                   9937033633804341
- serieTitle                            Chapbooks printed in Scotland
- editor                                                       Milne, John
- editor_date                                                    1792-1871
- genre                              Chapbooks-Scotland-Aberdeen-1801-1900
- language                                                             eng
- metsXML                                               104184105-mets.xml
- termsOfAddress                                                      None
- numberOfPages                                                          8
- numberOfWords                                                         53
- permanentURL                            https://digital.nls.uk/104184105
- physicalDescription                                        8 p. ; 18 cm.
- place                                                           Aberdeen
- publisher                             Printed by A. Imlay, 22, Long Acre
- referencedBy                                                        None
- shelfLocator                                               L.C.2786.A(1)
- altoXML                                  104184105/alto/107134030.34.xml
- serieSubTitle                             to the tune of Johnny Cop
- text                   A SONG JRAISB OP THE ^ HIGHLAND LADS. To the T...
- pageNum                                                            Page1
- volumeTitle                          song in praise of the highland lads
- volumeId                                                       104184105
- year                                                                1826
- collectionNum                                                          0
- part                                                                   0
- publisherPersons                                                      []
- numberOfVolumes                                                     3080
- volumeNum                                                             1

In [38]:
import re

import pandas as pd

# Nineteenth-Century Knowledge Project Dataframe
chapbook_df = pd.read_json('../source_dataframes/chapbooks/chapbooks_dataframe', orient="index")
chapbook_df =chapbook_df.fillna(0)

len(chapbook_df)

47329

In [39]:
series_mmsids = chapbook_df["MMSID"].unique()
print(series_mmsids)

[9937033633804340 9937038123804340 9937038533804340 ... 9937392033804340
 9937396893804340 9937393453804340]


In [40]:
df_series = chapbook_df[chapbook_df["numberOfVolumes"] > 1].reset_index(drop=True)
print(df_series)

                  MMSID edition         editor editor_date  \
0      9937039513804340       0              0           0   
1      9937039513804340       0              0           0   
2      9937039513804340       0              0           0   
3      9937039513804340       0              0           0   
4      9937039513804340       0              0           0   
...                 ...     ...            ...         ...   
10129  9937609363804340       0  Burns, Robert   1759-1796   
10130  9937609363804340       0  Burns, Robert   1759-1796   
10131  9937609363804340       0  Burns, Robert   1759-1796   
10132  9937609363804340       0  Burns, Robert   1759-1796   
10133  9937609363804340       0  Burns, Robert   1759-1796   

                                           genre language             metsXML  \
0      Chapbooks-Scotland-Dunfermline-1801-1900.      eng  104184112-mets.xml   
1      Chapbooks-Scotland-Dunfermline-1801-1900.      eng  104184112-mets.xml   
2      Chapb

## Load HTO-Ontology, and import data from NLS dataframe to it

In [41]:
from rdflib import Graph, URIRef, Namespace

# Create a new RDFLib Graph
graph = Graph()

# Load your ontology file into the graph
ontology_file = "hto.ttl"
graph.parse(ontology_file, format="turtle")
hto = Namespace("https://w3id.org/hto#")


In [42]:
# Print the number of "triples" in the Graph
print(f"Graph g has {len(graph)} statements.")
# Prints: Graph g has 86 statements.

Graph g has 536 statements.


In [43]:
import regex

NON_AZ09_REGEXP = regex.compile('[^\p{L}\p{N}]')
def name_to_uri_name(name):
    uri_name=NON_AZ09_REGEXP.sub('', name)
    return uri_name

In [44]:
from datetime import datetime
from rdflib import Literal, XSD, RDF
from rdflib.namespace import FOAF, PROV, SDO
def create_collection():
    collection = URIRef("https://w3id.org/hto/WorkCollection/ChapbooksOfScotland")
    graph.add((collection, RDF.type, hto.WorkCollection))
    graph.add((collection, hto.name, Literal("Chapbooks printed in Scotland Collection", datatype=XSD.string)))
    return collection

In [45]:
collection = create_collection()

In [46]:
from datetime import datetime
from rdflib import Literal, XSD, RDF, RDFS
from rdflib.namespace import FOAF, PROV, SDO

def series2rdf(series_info, graph, hto):

    # create triples with general datatype
    series = URIRef("https://w3id.org/hto/Series/"+str(series_info["MMSID"]))
    series_title= str(series_info["serieTitle"])
    graph.add((series, RDF.type, hto.Series))
    graph.add((collection, hto.hadMember, series))
    graph.add((series, hto.number, Literal(int(series_info["serieNum"]), datatype=XSD.integer)))
    graph.add((series, hto.title, Literal(series_title, datatype=XSD.string)))
    series_sub_title = str(series_info["serieSubTitle"])
    if series_sub_title != "0":
        graph.add((series, hto.subtitle, Literal(series_sub_title, datatype=XSD.string)))

    publish_year = str(series_info["year"])
    if publish_year != "0":
        graph.add((series, hto.yearPublished, Literal(publish_year, datatype=XSD.int)))
    # create a Location instance for printing place
    place_name = str(series_info["place"])
    if place_name != "0":
        place_uri_name = name_to_uri_name(place_name)
        place = URIRef("https://w3id.org/hto/Location/"+place_uri_name)
        graph.add((place, RDF.type, hto.Location))
        graph.add((place, RDFS.label, Literal(place_name, datatype=XSD.string)))
        graph.add((series, hto.printedAt, place))

    graph.add((series, hto.mmsid, Literal(str(series_info["MMSID"]), datatype=XSD.string)))
    graph.add((series, hto.physicalDescription, Literal(series_info["physicalDescription"], datatype=XSD.string)))
    graph.add((series, hto.genre, Literal(series_info["genre"], datatype=XSD.string)))
    graph.add((series, hto.language, Literal(series_info["language"], datatype=XSD.language)))

    # create a Location instance for shelf locator
    shelf_locator_name = str(series_info["shelfLocator"])
    shelf_locator_uri_name = name_to_uri_name(shelf_locator_name)
    shelf_locator = URIRef("https://w3id.org/hto/Location/"+shelf_locator_uri_name)
    graph.add((shelf_locator, RDF.type, hto.Location))
    graph.add((shelf_locator, RDFS.label, Literal(shelf_locator_name, datatype=XSD.string)))
    graph.add((series, hto.shelfLocator, shelf_locator))

    ## Editor
    if series_info["editor"] != 0:
        editor_name=str(series_info["editor"])
        editor_uri_name = name_to_uri_name(editor_name)
        if editor_name != "":
            editor = URIRef("https://w3id.org/hto/Person/"+str(editor_uri_name))
            graph.add((editor, RDF.type, hto.Person))
            graph.add((editor, FOAF.name, Literal(editor_name, datatype=XSD.string)))

        if series_info["editor_date"]!=0:
            editor_date = str(series_info["editor_date"]).replace("?", "")
            if editor_date.find("-") != -1:
                tmpDate=editor_date.split("-")

                birthDate=tmpDate[0]
                deathDate=tmpDate[1]

                if birthDate.isnumeric():
                    graph.add((editor, SDO.birthDate, Literal(int(birthDate), datatype=XSD.date)))
                if deathDate.isnumeric():
                    graph.add((editor, SDO.deathDate, Literal(int(deathDate), datatype=XSD.date)))
            else:
                print(f"date {editor_date} cannot be parsed!")


        if series_info["termsOfAddress"] != 0:
            graph.add((editor, hto.termsOfAddress, Literal(series_info["termsOfAddress"], datatype=XSD.string)))

        graph.add((series, hto.editor, editor))

    #### Publishers Persons

    #This was the result to pass entity recognition to publisher

    if series_info["publisherPersons"] != 0 and len(series_info["publisherPersons"]) > 0:
        publisherPersons=series_info["publisherPersons"]
        print(publisherPersons)
        if len(publisherPersons) == 1:
            publisher_name = publisherPersons[0]
            iri_publisher_name = name_to_uri_name(publisher_name)
            if iri_publisher_name != "":
                publisher = URIRef("https://w3id.org/hto/Person/"+iri_publisher_name)
                graph.add((publisher, RDF.type, hto.Person))
        else:
            iri_publisher_name = ""
            publisher_name = ""
            for p in publisherPersons:
                publisher_name = publisher_name + ", " + p
                iri_publisher_name= name_to_uri_name(publisher_name)
                if iri_publisher_name == "":
                    break
            publisher = URIRef("https://w3id.org/hto/Organization/"+iri_publisher_name)
            graph.add((publisher, RDF.type, hto.Organization))

        graph.add((publisher, FOAF.name, Literal(publisher_name, datatype=XSD.string)))
        graph.add((series, hto.publisher, publisher))

        # Creat an instance of publicationActivity
        # publication_activity = URIRef("https://w3id.org/hto/Activity/"+ "publication" + str(series_info["MMSID"]))
        # graph.add((publication_activity, RDF.type, PROV.Activity))
        # graph.add((publication_activity, PROV.generated, series))
        #  if publish_year != "0":
            # graph.add((publication_activity, PROV.endedAtTime, Literal(publish_year, datatype=XSD.dateTime)))
        # graph.add((publication_activity, PROV.wasEndedBy, publisher))

        #graph.add((series, PROV.wasGeneratedBy, publication_activity))

    #### Is Referenced by

    if series_info["referencedBy"] != 0:
        references=series_info["referencedBy"]
        for r in references:
            book_name = str(r)
            book_uri_name = name_to_uri_name(book_name)
            book = URIRef("https://w3id.org/hto/Book/"+book_uri_name)
            graph.add((book, RDF.type, hto.Book))
            graph.add((book, hto.name, Literal(book_name, datatype=XSD.string)))
            graph.add((series, hto.referencedBy, book))

    return series

In [47]:
def volume2rdf(volume_info, series, graph, hto):
    volume_id=str(volume_info["volumeId"])
    volume = URIRef("https://w3id.org/hto/Volume/"+str(volume_info["MMSID"])+"_"+str(volume_id))
    graph.add((volume, RDF.type, hto.Volume))
    graph.add((volume, hto.number, Literal(volume_info["volumeNum"], datatype=XSD.integer)))
    graph.add((volume, hto.volumeId, Literal(volume_id, datatype=XSD.string)))
    graph.add((volume, hto.title, Literal(volume_info["volumeTitle"], datatype=XSD.string)))

    if volume_info["part"]!=0:
        graph.add((volume, hto.part, Literal(volume_info["part"], datatype=XSD.integer)))

    permanentURL = URIRef(str(volume_info["permanentURL"]))
    graph.add((permanentURL, RDF.type, hto.Location))
    graph.add((volume, hto.permanentURL, permanentURL))
    # graph.add((volume, hto.numberOfPages, Literal(volume_info["numberOfPages"], datatype=XSD.integer)))
    graph.add((series, RDF.type, hto.WorkCollection))
    graph.add((series, hto.hadMember, volume))
    graph.add((volume, hto.wasMemberOf, series))

    return volume

In [48]:
# create software uris
defoe = URIRef("https://github.com/defoe-code/defoe")
graph.add((defoe, RDF.type, hto.SoftwareAgent))
frances_information_extraction = URIRef("https://github.com/frances-ai/frances-InformationExtraction")
graph.add((frances_information_extraction, RDF.type, hto.SoftwareAgent))

<Graph identifier=N94280024c99b4179a395e89c0c8dc49e (<class 'rdflib.graph.Graph'>)>

In [49]:
import re
# dataframe_to_RDF()
def dataframe_to_rdf(dataframe, graph, hto,  agent_uri, agent, chapbook_dataset):
    dataframe=dataframe.fillna(0)
    # create triples
    series_mmsids = dataframe["MMSID"].unique()
    for mmsid in series_mmsids:
        df_series = dataframe[dataframe["MMSID"] == mmsid].reset_index(drop=True)
        edition_info = df_series.loc[0]
        print(edition_info["serieTitle"])
        edition_ref = series2rdf(edition_info, graph, hto)

        # VOLUMES
        vol_numbers = df_series["volumeNum"].unique()
        # graph.add((edition_ref, hto.numberOfVolumes, Literal(len(vol_numbers), datatype=XSD.integer)))
        for vol_number in vol_numbers:
            df_vol = df_series[df_series["volumeNum"] == vol_number].reset_index(drop=True)
            volume_info = df_vol.loc[0]
            volume_ref = volume2rdf(volume_info, edition_ref, graph, hto)
            # print(volume_info)

            #### Pages
            for df_vol_index in range(0, len(df_vol)):
                df_page = df_vol.loc[df_vol_index]
                page_num = int(df_page["pageNum"])
                page_id = str(df_page["MMSID"])+"_"+str(df_page["volumeId"])+"_"+str(page_num)
                page_uri = URIRef("https://w3id.org/hto/Page/" + page_id)
                graph.add((page_uri, RDF.type, hto.Page))

                # Create original description for the page
                description = str(df_page["text"])
                if description != "":
                    page_original_description = URIRef("https://w3id.org/hto/OriginalDescription/" + page_id + agent)
                    graph.add((page_original_description, RDF.type, hto.OriginalDescription))
                    graph.add((page_original_description, hto.hasTextQuality, hto.Low))
                    graph.add((page_original_description, hto.text, Literal(description, datatype=XSD.string)))
                    graph.add((page_uri, hto.hasOriginalDescription, page_original_description))
                    graph.add((volume_ref,hto.hadMember, page_uri))
                    graph.add((page_original_description, PROV.wasAttributedTo, frances_information_extraction))

                    # Create source entity where original description was extracted
                    # source location
                    # source_path_name = df_entry["altoXML"]
                    # source_path_ref = URIRef("https://w3id.org/eb/Location/" + source_path_name)
                    # graph.add((source_path_ref, RDF.type, PROV.Location))
                    # source
                    source_name = df_page["altoXML"].replace("/", "_").replace(".", "_")
                    source_ref = URIRef("https://w3id.org/hto/InformationResource/" + source_name)
                    graph.add((source_ref, RDF.type, hto.InformationResource))
                    graph.add((chapbook_dataset, hto.hadMember, source_ref))
                    #graph.add((source_ref, PROV.atLocation, source_path_ref))
                    # related agent and activity
                    graph.add((source_ref, PROV.wasAttributedTo, agent_uri))
                    graph.add((source_ref, PROV.wasAttributedTo, defoe))


                    """
                    source_digitalising_activity = URIRef("https://w3id.org/eb/Activity/nls_digitalising_activity" + source_name)
                    graph.add((source_digitalising_activity, RDF.type, PROV.Activity))
                    graph.add((source_digitalising_activity, PROV.generated, source_ref))
                    graph.add((source_digitalising_activity, PROV.wasAssociatedWith, nls))
                    graph.add((source_ref, PROV.wasGeneratedBy, source_digitalising_activity))
                    """
                    graph.add((page_original_description, hto.wasExtractedFrom, source_ref))

    return graph

In [50]:
from datetime import datetime
from rdflib import Literal, XSD, RDF
from rdflib.namespace import FOAF, PROV, SDO
# create organization NCKP
def create_organization_nls(graph):
    nls = URIRef("https://w3id.org/hto/Organization/NLS")
    graph.add((nls, RDF.type, PROV.Organization))
    graph.add((nls, FOAF.name, Literal("National Library of Scotland", datatype=XSD.string)))
    return nls

In [51]:
def create_nls_chapbook_dataset(graph, nls):
    nls_chapbook_dataset = URIRef("https://w3id.org/hto/Collection/nckp_eb_text_dataset")
    graph.add((nls_chapbook_dataset, RDF.type, PROV.Collection))
    graph.add((nls_chapbook_dataset, PROV.wasAttributedTo, nls))

    # Create digitalising activity
    nls_digitalising_activity = URIRef("https://w3id.org/hto/Activity/nls_digitalising_activity")
    graph.add((nls_digitalising_activity, RDF.type, PROV.Activity))
    graph.add((nls_digitalising_activity, PROV.generated, nls_chapbook_dataset))
    graph.add((nls_digitalising_activity, PROV.wasAssociatedWith, nls))
    graph.add((nls_chapbook_dataset, PROV.wasGeneratedBy, nls_digitalising_activity))
    return nls_chapbook_dataset

In [52]:
agent = "NLS"
nls = create_organization_nls(graph)
nls_chapbook_dataset = create_nls_chapbook_dataset(graph, nls)
dataframe_to_rdf(chapbook_df, graph, hto,  nls, agent, nls_chapbook_dataset)

song in praise of the highland lads
Two new songs viz
['J. Smith']
Report of the cattle show at Trearne
['J. Smith']
Laird of Ardenoaige and the Ghost of Fenhaglen
['Donald']
comical story of Thrummy Cap and the ghaist
old Scottish tragical ballad of Sir James the Rose
True and correct narrative of the dreadful burning of the steam-ship Amazon
['R. Crombie', 'G. Gordon', 'James Redger']
picture of war
['John Miller']
Tales for the farmers' ingle-neuk
['John Miller']
Elegy on the year eighty-eight
['David Willison', 'Craig', 'George Gray']
Melancholy loss of the whale-fishing ship Oscar, of  Aberdeen, on Thursday, April 1, 1813
emigrant; a poem
['J. Robertson']
Watty & Meg or the wife reformed. A tale
['T. Oliver', 'Fountain Well']
Mill o' Tiftie's Annie, or, Andrew Lammie, the trumpeter of Fyvie
['Clark', 'Son']
three crump twin brothers of Damascus
['J. Morren', 'Cowgate']
Interesting letter from Queen Caroline to King George IV
['Lawnmarket']
Wellington's address
Life of Robert Burns

<Graph identifier=N94280024c99b4179a395e89c0c8dc49e (<class 'rdflib.graph.Graph'>)>

In [53]:
from rdflib.plugins.sparql import prepareQuery
q1 = prepareQuery('''
  SELECT ?series ?title WHERE {
    ?series a hto:Series;
        hto:title ?title

  } LIMIT 25
  ''',
  initNs = { "hto": hto}
)

for r in graph.query(q1):
      print("%s %s" % (r.series, r.title))

https://w3id.org/hto/Series/9937033633804340 song in praise of the highland lads
https://w3id.org/hto/Series/9937038123804340 Two new songs viz
https://w3id.org/hto/Series/9937038533804340 Report of the cattle show at Trearne
https://w3id.org/hto/Series/9936745783804340 Laird of Ardenoaige and the Ghost of Fenhaglen
https://w3id.org/hto/Series/9937038873804340 comical story of Thrummy Cap and the ghaist
https://w3id.org/hto/Series/9937039163804340 old Scottish tragical ballad of Sir James the Rose
https://w3id.org/hto/Series/9937039473804340 True and correct narrative of the dreadful burning of the steam-ship Amazon
https://w3id.org/hto/Series/9937039513804340 picture of war
https://w3id.org/hto/Series/9937039643804340 Tales for the farmers' ingle-neuk
https://w3id.org/hto/Series/9937039863804340 Elegy on the year eighty-eight
https://w3id.org/hto/Series/9937033803804340 Melancholy loss of the whale-fishing ship Oscar, of  Aberdeen, on Thursday, April 1, 1813
https://w3id.org/hto/Serie

In [54]:
# Save the Graph in the RDF Turtle format
graph.serialize(format="turtle", destination="../results/hto_chapbooks_nls.ttl")

<Graph identifier=N94280024c99b4179a395e89c0c8dc49e (<class 'rdflib.graph.Graph'>)>