# Ontology for the metadata

This notebook is used to create the ontology for the metadata of the Judaicalink project. The ontology is created in RDF and is used to describe the metadata of the books and journals in the Judaicalink project. The ontology is created using the RDFlib library in Python.

Important Namespaces are:
GNDO - https://d-nb.info/standards/elementset/gnd
Judaicalink - http://data.judaicalink.org/ontology/

The generator ist based on the Ontology-Builder for the Judaicalink project.
https://github.com/judaicalink/judaicalink-ontology


In [2]:
rdf_data= """
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix gndo: <https://d-nb.info/standards/elementset/gnd#> .
@prefix jl: <https://ontology.judaicalink.org/> .
@prefix cm: <https://sammlungen.ub.uni-frankfurt.de/cm/> .

# Define the class Journal
cm:Journal a owl:Class ;
    rdfs:label "Journal" ;
    rdfs:subClassOf gndo:Periodical .
    
cm:YearFrom a owl:DatatypeProperty ;
    rdfs:label "Year From" ;
    rdfs:domain cm:Journal ;
    rdfs:range jl:gYear .
    
cm:YearTo a owl:DatatypeProperty ;
    rdfs:label "Year To" ;
    rdfs:domain cm:Journal ;
    rdfs:range jl:gYear .
    
# Define properties of Journal
cm:title a owl:DatatypeProperty ;
    rdfs:label "Title" ;
    rdfs:domain cm:Journal .

cm:governingBody a owl:ObjectProperty ;
    rdfs:label "Governing Body" ;
    rdfs:domain cm:Journal ;
    rdfs:range gndo:CorporateBody .

cm:description a owl:DatatypeProperty ;
    rdfs:label "Description" ;
    rdfs:domain cm:Journal .

cm:language a owl:DatatypeProperty ;
    rdfs:label "Language" ;
    rdfs:domain cm:Journal ;
    rdfs:range jl:string .

cm:onlineEdition a owl:DatatypeProperty ;
    rdfs:label "Online Edition" ;
    rdfs:domain cm:Journal ;
    rdfs:range jl:string .

cm:URN a owl:DatatypeProperty ;
    rdfs:label "URN" ;
    rdfs:domain cm:Journal ;
    rdfs:range jl:string .

# Define subclass Publication
cm:Publication a owl:Class ;
    rdfs:label "Publication" .

# Define properties of Publication
cm:place a owl:DatatypeProperty ;
    rdfs:label "Place" ;
    rdfs:domain cm:Publication ;
    rdfs:range jl:string .

cm:publisher a owl:DatatypeProperty ;
    rdfs:label "Publisher" ;
    rdfs:domain cm:Publication ;
    rdfs:range jl:string .

cm:printingPress a owl:DatatypeProperty ;
    rdfs:label "Printing Press" ;
    rdfs:domain cm:Publication ;
    rdfs:range jl:string .

cm:publicationDate a owl:DatatypeProperty ;
    rdfs:label "Publication Date" ;
    rdfs:domain cm:Publication ;
    rdfs:range jl:date .

# Define subclass Volume
cm:Volume a owl:Class ;
    rdfs:label "Volume" .

# Define properties of Volume
cm:hasIssue a owl:ObjectProperty ;
    rdfs:label "Has Issue" ;
    rdfs:domain cm:Volume ;
    rdfs:range cm:Issue .

# Define subclass Issue
cm:Issue a owl:Class ;
    rdfs:label "Issue" .

# Define properties of Issue
cm:issueNumber a owl:DatatypeProperty ;
    rdfs:label "Issue Number" ;
    rdfs:domain cm:Issue ;
    rdfs:range jl:integer .

cm:issueYear a owl:DatatypeProperty ;
    rdfs:label "Issue Year" ;
    rdfs:domain cm:Issue ;
    rdfs:range jl:gYear .
    
cm:issueDate a owl:DatatypeProperty ;
    rdfs:label "Issue Date" ;
    rdfs:domain cm:Issue ;
    rdfs:range jl:date .

cm:issueLink a owl:DatatypeProperty ;
    rdfs:label "Issue Link" ;
    rdfs:domain cm:Issue ;
    rdfs:range jl:anyURI .

cm:digitalizationType a owl:DatatypeProperty ;
    rdfs:label "Digitalization Type" ;
    rdfs:domain cm:Issue ;
    rdfs:range jl:string .

# Define additional properties for Journal
cm:hasVolume a owl:ObjectProperty ;
    rdfs:label "Has Volume" ;
    rdfs:domain cm:Journal ;
    rdfs:range cm:Volume .

cm:hasTitlePage a owl:ObjectProperty ;
    rdfs:label "Has Title Page" ;
    rdfs:domain cm:Journal ;
    rdfs:range cm:TitlePage .

cm:hasTableOfContents a owl:ObjectProperty ;
    rdfs:label "Has Table of Contents" ;
    rdfs:domain cm:Journal ;
    rdfs:range cm:TableOfContents .

cm:alternativeTitle a owl:DatatypeProperty ;
    rdfs:label "Alternative Title" ;
    rdfs:domain cm:Journal ;
    rdfs:range jl:string .

cm:corpusBelongsTo a owl:ObjectProperty ;
    rdfs:label "Corpus Belongs To" ;
    rdfs:domain cm:Journal ;
    rdfs:range gndo:Corpus .

cm:continuedUnder a owl:ObjectProperty ;
    rdfs:label "Continued Under" ;
    rdfs:domain cm:Journal ;
    rdfs:range cm:Journal .

cm:linkedJournal a owl:ObjectProperty ;
    rdfs:label "Linked Journal" ;
    rdfs:domain cm:Journal ;
    rdfs:range cm:Journal .

# Define subclass TitlePage
cm:TitlePage a owl:Class ;
    rdfs:label "Title Page" .

# Define properties of TitlePage
cm:titleDescription a owl:DatatypeProperty ;
    rdfs:label "Title Description" ;
    rdfs:domain cm:TitlePage ;
    rdfs:range jl:string .

cm:titleLink a owl:DatatypeProperty ;
    rdfs:label "Title Link" ;
    rdfs:domain cm:TitlePage ;
    rdfs:range jl:anyURI .

# Define subclass TableOfContents
cm:TableOfContents a owl:Class ;
    rdfs:label "Table of Contents" .

# Define properties of TableOfContents
cm:contentsTitle a owl:DatatypeProperty ;
    rdfs:label "Contents Title" ;
    rdfs:domain cm:TableOfContents ;
    rdfs:range jl:string .

cm:contentsLink a owl:DatatypeProperty ;
    rdfs:label "Contents Link" ;
    rdfs:domain cm:TableOfContents ;
    rdfs:range jl:anyURI .
"""

! The ontology is saved in the file ontology.ttl

# Save the Ontology

In [7]:
from rdflib import Graph

# Create an empty RDF graph
graph = Graph()

graph.parse(data=rdf_data, format="turtle")

# Define the output file path
output_file_path = 'ontology.ttl'

# Serialize the graph into Turtle format and write it to a file
with open(output_file_path, 'w') as output_file:
    output_file.write(graph.serialize(format='turtle'))

print("Ontology Turtle file created successfully at:", output_file_path)

Ontology Turtle file created successfully at: ontology.ttl


# Zeitschriften in Compact Memory


### Save the data to the graph

In [40]:
import pandas as pd
from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF, RDFS, XSD

# Load the DataFrame

# Filter out rows with NaN values in specified columns
#df = df.dropna(subset=['Zs_Caption', 'Volume_Caption', 'Heft_Caption', 'Datum'])

# Replace NaN values with empty strings
df = df.fillna('')

# Filter out rows with empty strings in specified columns
df = df[(df['Zs_Caption'] != '') &
        (df['Volume_Caption'] != '') &
        (df['Heft_Caption'] != '') &
        (df['Datum'] != '')]

# Create an RDF graph
g = Graph()

# Define namespaces
gndo = Namespace("https://d-nb.info/standards/elementset/gnd#")
cm = Namespace("https://data.judaicalink.org/ontology/cm")


# Iterate over each row and add triples to the graph
for idx, row in df.iterrows():
    # Check for NaN values in specified columns
    if pd.notna(row['Zs_Caption']) and pd.notna(row['Volume_Caption']) and pd.notna(row['Heft_Caption']) and pd.notna(row['Datum']):
        # Create URIs for the journal, volume, and issue
        journal_uri = URIRef(f"{cm}{row['Zs_Caption']}")
        volume_uri = URIRef(f"{cm}{row['Volume_Caption']}")
        issue_uri = URIRef(f"{cm}{row['Heft_Caption']}")

        # Add triples for the journal, volume, and issue
        g.add((journal_uri, RDF.type, gndo.Periodical))
        g.add((journal_uri, RDFS.label, Literal(row['Zs_Caption'], datatype=XSD.string)))

        g.add((volume_uri, RDF.type, gndo.Volume))
        g.add((volume_uri, RDFS.label, Literal(row['Volume_Caption'], datatype=XSD.string)))
        g.add((volume_uri, gndo.partOf, journal_uri))

        g.add((issue_uri, RDF.type, gndo.Issue))
        g.add((issue_uri, RDFS.label, Literal(row['Heft_Caption'], datatype=XSD.string)))
        g.add((issue_uri, gndo.partOf, volume_uri))
        g.add((issue_uri, gndo.publicationDate, Literal(row['Datum'], datatype=XSD.date)))

# Save the graph as a Turtle file
g.serialize(destination='output.ttl', format='turtle')


## Load the Journal Metadata from the JSON file

In [None]:
import json

with open("metadata/journal_metadata/journal_metadata_title_lang.json") as f:
    data = json.load(f)

# Build the Graph

In [None]:
from rdflib import Graph, Literal, RDF, URIRef, Namespace
from rdflib.namespace import DC, FOAF

# create GRAPH
g = Graph()

# create Namespaces


# Load CM_Knoten_MODS_XML_15.06.2020
This file contains two columns: OC_OT_ID and OC_XML. The OC_OT_ID is the identifier of the object in the database. The OC_XML is the XML representation of the object.

This page links to "CM_Seiten_Metadaten_15.06.2020.xlsx".


In [25]:
import pandas as pd

# Load the CSV file with dtype parameter
df_cm_knots = pd.read_csv("metadata/CM_Knoten_MODS_XML_15.06.2020", sep="\t")
df_cm_knots.head()

Unnamed: 0,OC_OT_ID,OC_XML
0,6727495,"<mods xmlns=""http://www.loc.gov/mods/v3"" versi..."
1,9981088,"<mods xmlns=""http://www.loc.gov/mods/v3"" versi..."
2,6074819,"<mods xmlns=""http://www.loc.gov/mods/v3"" versi..."
3,9594165,"<mods xmlns=""http://www.loc.gov/mods/v3"" versi..."
4,9574181,"<mods xmlns=""http://www.loc.gov/mods/v3"" versi..."


In [34]:
# interate over the rows
import xml.etree.ElementTree as ET

for idx, row in df_cm_knots.iterrows():
    if idx <= 10:
        print(row['OC_OT_ID'])
        print(row['OC_XML'])
        #tree = ET.parse(row['OC_OT_ID'])
        #ֿroot = tree.getroot()
        #for child in root:
            #print(child.tag, child.attrib)
            #for sub_child in child:
                #print(sub_child.tag, sub_child.text)

6727495
<mods xmlns="http://www.loc.gov/mods/v3" version="3.5"><part><date encoding="w3cdtf">1936</date><detail><caption>1936</caption></detail></part></mods>
9981088
<mods xmlns="http://www.loc.gov/mods/v3" version="3.5"><typeOfResource>text</typeOfResource><part><detail><number>21</number></detail><date encoding="w3cdtf">1923</date></part></mods>
6074819
<mods xmlns="http://www.loc.gov/mods/v3" version="3.5"><typeOfResource>text</typeOfResource><genre authority="marcgt">issue</genre><part><detail><caption>Inhalts-Verzeichnis</caption></detail><date encoding="w3cdtf">1930</date></part></mods>
9594165
<mods xmlns="http://www.loc.gov/mods/v3" version="3.5"><typeOfResource>text</typeOfResource><genre authority="marcgt">issue</genre><part><detail><caption>Erste Zugabe des zweyten Jahrgangs zu der hebr&#228;ischen Monatsschrift (&#1492;&#1502;&#1488;&#1505;&#1507;) dem Sammler. (December 1784)</caption></detail><date encoding="w3cdtf">1784-12</date></part></mods>
9574181
<mods xmlns="http:

# Load filename to meta

File: filename_to_meta.json
This file contains the mapping from the filename to the metadata of the file.
The object is a dictionary with the filename as the key and the metadata as the value.
The metadata is a dictionary with the following keys:
    y - Year 
    j_id - Journal ID
    jn - Journal Name
    pdf - PDF (Boolean)
    vlid - Volume ID
    
This can be matched to the corresponding metadata in the CM_Seiten_Metadata.csv file.

In [41]:
import json

with open("metadata/short_metadata/filename_to_meta.json") as f:
    data = json.load(f)

journal_titles_from_meta = []
# iterate over all the objects and print their keys
for key in data:
    print(key)
    print(data[key].values())
    journal_titles_from_meta.append(data[key]['jn'])
    break
    
print(journal_titles_from_meta)


# compare the titles from the metadata with the titles from the CM_Seiten_Metadata.csv file
journal_titles = df['Zs_Caption'].unique()
for title in journal_titles:
    if title not in journal_titles_from_meta:
        print(title)

52596_9443938__02_52598
dict_values(['1835', '52596', '... Bericht ueber den Verein für die Provinz Westfalen zur Bildung von Elementar-Lehrern und Befoerderung von Handwerken und Kuensten unter den Juden', True, '52598'])
['... Bericht ueber den Verein für die Provinz Westfalen zur Bildung von Elementar-Lehrern und Befoerderung von Handwerken und Kuensten unter den Juden']


KeyError: 'Zs_Caption'

# Load pages by year
This file lists all the pages by year. The object is a dictionary with the year as the key and the list of pages as the value.

The pages have the following format:

Example: 2577043_2577044_2577045__0890_2577534

OT_PATH --> 2577043||2577044||2577045||2577534
But __0890_ is missing in the OT_PATH and unclear!

They are represented in CM_Seiten_Metadata.csv as the column 'OT_PATH'.

But the OT_PATH contains "||" instead of "_".


It seems that the data is not correct.

# Load the data from the JSON file

In [42]:
import json

with open("metadata/short_metadata/pages_by_year.json") as f:
    data = json.load(f)

# iterate over all the objects and print their keys
for key in data:
    print(key)
    print(data[key])
    break

1760
['2577043_2577044_2577045__0890_2577534', '2577043_2577044_2577045__0891_2577535', '2577043_2577044_2577045__0892_2577536', '2577043_2577044_2577045__0893_2577537', '2577043_2577044_2577045__0894_2577538', '2577043_2577044_2577045__0895_2577539', '2577043_2577044_2577045__0896_2577540', '2577043_2577044_2577045__0897_2577541', '2577043_2577044_2577045__0898_2577542', '2577043_2577044_2577045__0899_2577543', '2577043_2577044_2577045__0900_2577544', '2577043_2577044_2577045__0901_2577545', '2577043_2577044_2577045__0902_2577546', '2577043_2577047_2577048__0003_2577078', '2577043_2577047_2577048__0004_2577079', '2577043_2577047_2577048__0005_2577080', '2577043_2577047_2577048__0006_2577081', '2577043_2577047_2577048__0007_2577082', '2577043_2577047_2577048__0008_2577083', '2577043_2577047_2577048__0009_2577084', '2577043_2577047_2577048__0010_2577085', '2577043_2577047_2577048__0011_2577086', '2577043_2577047_2577048__0012_2577087', '2577043_2577047_2577048__0013_2577088', '2577043_2