# Extracting RDF graphs from TEI/XML documents using lxml.etree and RDFLib 

This Jupyter notebook is a step-by-step guide to the extraction of RDF graphs from TEI/XML documents as suggested by LIFT. LIFT is an open-source web application based entirely on Python. The aim of LIFT is to show and demonstrate how it is possible to extract RDF graphs, supported by widely adopted ontological vocabularies, from TEI/XML documents.

Before getting started with our RDF extraction, here is some useful documentation:  
**TEI/XML** <https://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html>  
**lxml.etree** <https://lxml.de/tutorial.html>  
**RDFLib** <https://rdflib.readthedocs.io/en/stable/index.html>

## Setting up lxml

A way to process XML using Python is to leverage lxml.etree, which we can import as follows:

In [8]:
from lxml import etree

To read from a TEI/XML file (further on referred to as 'input' or 'TEI document'), we use the parse() function. Make sure to specify the correct path. In this case, the file 'input.xml' is stored in the current folder. For a basic introduction to paths see <https://www.w3schools.com/html/html_filepaths.asp>.

In [9]:
tree = etree.parse('input.xml')

lxml allows the retrieval of the root element of our input TEI/XML document through the getroot() function. We store the result in a variable that we conveniently called 'root'.  

In [10]:
root = tree.getroot()

Again, for convenience, we use the variables 'base_uri' and 'edition_id' to store the values of the xml:base and xml:id attributes assigned to the root <TEI> element in our TEI/XML input document. The attribute xml:base contains the base URI for the project (e.g. <http://example.org>), while the attribute xml:id features a unique ID for the TEI file. Note how to recall an attribute with an XML namespace, such as xml:id, using lxml: you need to include the actual namespace instead of the prefix. 

In [27]:
base_uri = root.get('{http://www.w3.org/XML/1998/namespace}base')
edition_uri = root.get('{http://www.w3.org/XML/1998/namespace}id')

We also want to bind the TEI namespace to the prefix 'tei' (we will use this later to refer to TEI elements) as follows:

In [1]:
tei = {'tei': 'http://www.tei-c.org/ns/1.0'}

## Bringing in RDFLib

A way to import RDFLib is as follows:

In [32]:
from rdflib import Graph, Literal, BNode, Namespace, URIRef

We now need to declare the namespaces of the ontological vocabularies that are going to provide semantics for our RDF graph.

 The following namespaces are available by direct import from RDFLib:

In [33]:
from rdflib.namespace import RDF, RDFS, XSD, DCTERMS, OWL

We declare any remaining namespace as follows:

In [None]:
agrelon = Namespace("https://d-nb.info/standards/elementset/agrelon#")
crm = Namespace("http://www.cidoc-crm.org/cidoc-crm/")
frbroo = Namespace("http://iflastandards.info/ns/fr/frbr/frbroo/")
pro = Namespace("http://purl.org/spar/pro/")
proles = Namespace("http://www.essepuntato.it/2013/10/politicalroles/")
prov = Namespace("http://www.w3.org/ns/prov#")
schema = Namespace("https://schema.org/")
tvc = Namespace("http://www.essepuntato.it/2012/04/tvc/")

An RDFLib graph is a set of RDF triples. We declare our output graph and name it 'g':

In [34]:
g = Graph()

## Extracting RDF statements about persons

For each person in the TEI document, we:
1. Extract the person's xml:id
2. Build a unique URI for the person by concatenating the 'base_uri' from above with the person's xml:id. In order to make clear what kind of resource the URI represents, we also add the directory '/person/' before the actual person's xml:id.
3. We add our first triple to the RDF graph: the person's URI is the subject, followed by the predicate rdf:type, and the class schema:Person. By doing so, we assign the individual person to the ontological class schema:Person (<https://schema.org/Person>).

In [None]:
for person in root.findall('.//tei:person', tei):
    person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
    person_uri = URIRef(base_uri + '/person/' + person_id)
    g.add( (person_uri, RDF.type, schema.Person))

Then, we look for a sameAs attribute associated with the person. This attribute should contain one or more URI pointing to authority records, such as VIAF, or to other resources about the same person, for example from DBpedia:
1. Using the get() function, we look for a sameAs attribute and split its contents by whitespace (line 4).
2. We loop through the list of URIs as many times as the total number of URIs stored in the sameAs attribute (lines 5-9), store the URIs in a variable 'same_as_uri' (line 7), then add a triple to the RDF graph at each loop (line 8): the person's URI is the subject, followed by the predicate owl:sameAs, and the retrieved sameAs URI. For example, if a sameAs attribute contains two URIs, two distinct RDF triples are added to the graph. 

In [None]:
for person in root.findall('.//tei:person', tei):
    person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
    person_uri = URIRef(base_uri + '/person/' + person_id)
    same_as = person.get('sameAs').split()
    i = 0
    while i < len(same_as):
        same_as_uri = URIRef(same_as[i])
        g.add( (person_uri, OWL.sameAs, same_as_uri))
        i += 1

The next step is to provide each person's entity with a human-readable label, linked to the subject via the RDF property 'rdf:label'. In order to do so:
1. We iterate again through all persons looking for personal names, i.e. a child element <persName> (lines 1-4).
2. We then store the contents of such an element in the 'label' variable, as well as look for an xml:lang attribute associated with the label (line 6-7).
3. If an xml:lang is found, the script adds a triple featuring the person whose rdf:label is a Literal value to which a language declaration is attached (line 8).
4. Otherwise, the script creates a triple whithout declaring any specific language (line 10).

In [None]:
for person in root.findall('.//tei:person', tei):
    person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
    person_uri = URIRef(base_uri + '/person/' + person_id)
    persname = person.find('./tei:persName', tei)
    label = persname.text 
    if persname.get('{http://www.w3.org/XML/1998/namespace}lang') is not None:
        label_lang = persname.get('{http://www.w3.org/XML/1998/namespace}lang')
        g.add( (person_uri, RDFS.label, Literal(label, lang=label_lang)))
    else:
        g.add( (person_uri, RDFS.label, Literal(label)))

The following code is divided into smaller functions, each performing a specific part of the extraction. The first function, subject(), 

In [9]:
def subject(person):
    #person_uri = URIRef(base_uri + '/person/' + person_id)
    g.add( (person_uri, RDF.type, schema.Person))

In [10]:
def sameas(person):    
    same_as = person.get('sameAs').split()
    i = 0
    while i < len(same_as):
        same_as_uri = URIRef(same_as[i])
        g.add( (person_uri, OWL.sameAs, same_as_uri))
        i += 1

In [11]:
def persname(person):
    persname = person.find('./tei:persName', tei)
    label = persname.text
    label_lang = persname.get('{http://www.w3.org/XML/1998/namespace}lang')
    if label_lang is not None:
        g.add( (person_uri, RDFS.label, Literal(label, lang=label_lang)))
    else:
        g.add( (person_uri, RDFS.label, Literal(label)))

In [12]:
def referenced_person(person_id):
    ref = './tei:text//tei:persName[@ref="#' + person_id + '"]'
    for referenced_person in root.findall(ref, tei):
        parent = referenced_person.getparent()
        parent_id = parent.get('{http://www.w3.org/XML/1998/namespace}id')
        parent_uri = URIRef(base_uri + '/text/' + parent_id)
        g.add( (person_uri, DCTERMS.isReferencedBy, parent_uri))
        g.add( (parent_uri, RDF.type, frbroo.F23_Expression_Fragment))
        g.add( (parent_uri, frbroo.R15i_is_fragment_of, URIRef(base_uri + '/' + edition_id)))

In [13]:
def perstype(person):
    listperson = person.find('./...', tei)
    perstype = listperson.get('type')
    perscorr = listperson.get('corresp')
    if perstype is not None:
        g.add( (person_uri, DCTERMS.description, Literal(perstype)))
    if perscorr is not None and perscorr.startswith('http'):
        g.add( (person_uri, DCTERMS.subject, URIRef(perscorr)))

In [14]:
for person in root.findall('.//tei:person', tei):
    person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
    person_uri = URIRef(base_uri + '/person/' + person_id)
    person_ref = '#' + person_id
    subject(person)
    sameas(person)
    persname(person)
    referenced_person(person_id)
    perstype(person)

## Event

In [15]:
def partic_event(person):     
    for event in person.findall('./tei:event', tei):
        event_id = event.get('{http://www.w3.org/XML/1998/namespace}id')
        partic_event_uri = URIRef(base_uri + '/' + person_id + '-in-' + event_id)
        if event is not None:
            g.add( (person_uri, pro.holdsRoleInTime, partic_event_uri))

In [16]:
def role_in_event(person):
    for event in person.findall('./tei:event', tei):
        event_id = event.get('{http://www.w3.org/XML/1998/namespace}id')
        persName = person.find('./tei:persName', tei)
        label = persName.text
        rit_uri = URIRef(base_uri + '/rit/' + person_id + '-at-' + event_id)
        pers_in_event = event.find('./tei:desc/tei:persName', tei)
    
        g.add( (rit_uri, RDF.type, pro.RoleInTime))
        
        if pers_in_event is not None and pers_in_event.get('ref') == person_ref and pers_in_event.get('role') is not None:
            role_uri = URIRef(base_uri + '/role/' + pers_in_event.get('role'))
            g.add( (rit_uri, pro.withRole, role_uri))
            g.add( (role_uri, RDF.type, pro.Role))
        
        if pers_in_event.get('corresp') is not None:
            g.add( (role_uri, OWL.sameAs, pro.Role))
            g.add( (role_uri, RDFS.label, pro.Role))
            corresp_role_uri = URIRef(pers_in_event.get('corresp'))
            g.add( (role_uri, OWL.sameAs, corresp_role_uri))
            role_label = pers_in_event.get('role')
            g.add( (role_uri, RDFS.label, Literal(role_label)))
        else:
            g.add( (rit_uri, pro.withRole, URIRef(base_uri + '/role/participant')))
            role_uri = URIRef(base_uri + '/role/participant')
            g.add( (role_uri, RDF.type, pro.Role))
            g.add( (role_uri, OWL.sameAs, URIRef('http://wordnet-rdf.princeton.edu/id/10421528-n')))
            g.add( (role_uri, RDFS.label, Literal('participant'))) 

        g.add( (rit_uri, tvc.atTime, URIRef(base_uri + '/tvc/' + event_id + '-time')))
        g.add( (rit_uri, pro.relatesToEntity, URIRef(base_uri + '/event/' + event_id)))

        place = event.find('./tei:desc/tei:placeName', tei)
        if place > 1:
            place_of_event = place.get('type="place_of_event"')
            g.add( (rit_uri, proles.relatesToPlace, URIRef(base_uri + '/place/' + place.get('ref').replace("#", ""))))
        elif event.find('./tei:desc/tei:placeName', tei) == 1:
            g.add( (rit_uri, proles.relatesToPlace, URIRef(base_uri + '/place/' + place.get('ref').replace("#", ""))))       

In [17]:
def event_time():
    g.add( (event_time_uri, RDF.type, URIRef('http://www.ontologydesignpatterns.org/cp/owl/timeinterval.owl#TimeInterval')))
    if event.get('when') is not None:
        g.add( (event_time_uri, OWL.hasIntervalStartDate, Literal(event.get('when'), datatype=XSD.date)))
        g.add( (event_time_uri, OWL.hasIntervalEndDate, Literal(event.get('when'), datatype=XSD.date)))
    if event.get('from') is not None:
        g.add( (event_time_uri, OWL.hasIntervalStartDate, Literal(event.get('from'), datatype=XSD.date)))
    if event.get('to') is not None:
        g.add( (event_time_uri, OWL.hasIntervalEndDate, Literal(event.get('to'), datatype=XSD.date)))

In [18]:
def event_desc():
    g.add( (event_uri, RDF.type, crm.E5_Event))
    g.add( (event_uri, RDF.type, schema.Event))
    if event.find('./tei:label', tei) is not None:
        label = event.find('./tei:label', tei).text
        g.add( (event_uri, RDFS.label, Literal(label)))
    if evtype is not None:
        g.add( (event_uri, DCTERMS.description, Literal(evtype)))
    if evcorr is not None and evcorr.startswith('http'):
        g.add( (event_uri, DCTERMS.subject, URIRef(evcorr)))

In [19]:
def event_source():
    source = event.find('./tei:bibl', tei)
    if source is not None:
        source_id = source.get('{http://www.w3.org/XML/1998/namespace}id')
        source_uri = URIRef(base_uri + '/source/' + source_id)
        g.add( (event_uri, prov.hasPrimarySource, source_uri))
        for event_source in root.findall('.//tei:event//tei:bibl', tei):
            g.add( (source_uri, RDF.type, prov.PrimarySource))
            if event_source.find('./tei:author', tei) is not None and event_source.find('./tei:author', tei).get('ref') is not None:
                author_ref = event_source.find('./tei:author', tei).get('ref')
                author_id = author_ref.split('#')
                g.add( (source_uri, DCTERMS.creator, URIRef(base_uri + '/person/' + author_id[1])))
            if event_source.find('.tei:title', tei) is not None:
                g.add( (source_uri, DCTERMS.title, Literal(event_source.find('.tei:title', tei).text)))
            if event_source.get('sameAs') is not None:
                sameAs = event_source.get('sameAs')
                if sameAs.startswith('http'):
                    g.add( (source_uri, OWL.sameAs, URIRef(event_source.get('sameAs')))) 
            if event_source.find('.tei:date', tei) is not None:
                evdate = event_source.find('.tei:date', tei)
                g.add( (source_uri, DCTERMS.date, Literal(evdate.get('when'), datatype=XSD.date)))

Call functions

In [20]:
for person in root.findall('.//tei:person', tei):
    person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
    person_uri = URIRef(base_uri + '/person/' + person_id)
    person_ref = '#' + person_id
    partic_event(person)
    role_in_event(person)

In [21]:
for event in root.findall('.//tei:event', tei):
    event_id = event.get('{http://www.w3.org/XML/1998/namespace}id')
    event_time_uri = URIRef(base_uri + '/' + event_id + '-time')
    event_uri = URIRef(base_uri + '/event/' + event_id)
    evcorr = event.get('corresp')
    evtype = event.get('type')
    event_time()
    event_desc()
    event_source()

## Relation

In [22]:
def relation(person):
    for relation in root.findall('.//tei:listRelation/tei:relation', tei):
        person_ref = '#' + person_id
        if relation.get('active') is not None and relation.get('active') == person_ref:
            passive = relation.get('passive').replace("#", "").split()
            i = 0
            while i < len(passive):
                g.add( (person_uri, agrelon[relation.get('name')], URIRef(base_uri + '/' + passive[i])))
                i += 1
        elif relation.get('mutual') is not None:
            relentity = relation.get('mutual').split()
            if person_ref in relentity:
                mutual = relation.get('mutual').replace("#", "").replace(person_id, "").split()
                i = 0
                while i < len(mutual):
                    g.add( (person_uri, agrelon[relation.get('name')], URIRef(base_uri + '/' + mutual[i])))
                    i += 1

In [23]:
for person in root.findall('.//tei:person', tei):
    person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
    person_uri = URIRef(base_uri + '/person/' + person_id)
    person_ref = '#' + person_id
    relation(person)

## Place

In [24]:
def place_subject(place):
    g.add( (place_uri, RDF.type, schema.Place))

In [25]:
def place_sameas(place):
    same_as = place.get('sameAs').split()
    i = 0
    while i < len(same_as):
        same_as_uri = URIRef(same_as[i])
        g.add( (place_uri, OWL.sameAs, same_as_uri))
        i += 1

In [26]:
def placename(place):
    placename = place.find('./tei:placeName', tei)
    label = placename.text
    label_lang = placename.get('{http://www.w3.org/XML/1998/namespace}lang')
    if label_lang is not None:
        g.add( (place_uri, RDFS.label, Literal(label, lang=label_lang)))
    else:
        g.add( (place_uri, RDFS.label, Literal(label)))

In [27]:
def referenced_place(place_id):
    ref = './/tei:placeName[@ref="#' + place_id + '"]'
    for referenced_place in root.findall(ref, tei):
        parent = referenced_place.getparent()
        parent_id = parent.get('{http://www.w3.org/XML/1998/namespace}id')
        parent_uri = URIRef(base_uri + '/text/' + parent_id)
        g.add( (place_uri, DCTERMS.isReferencedBy, parent_uri))
        g.add( (parent_uri, RDF.type, frbroo.F23_Expression_Fragment))
        g.add( (parent_uri, frbroo.R15i_is_fragment_of, URIRef(base_uri + '/' + edition_id)))

Call functions

In [28]:
for place in root.findall('.//tei:place', tei):
    place_id = place.get('{http://www.w3.org/XML/1998/namespace}id')
    place_uri = URIRef(base_uri + '/place/' + place_id)
    place_ref = '#' + place_id
    place_subject(place)
    place_sameas(place)
    placename(place)
    referenced_place(place_id)

In [29]:
# bind prefix
g.bind("agrelon", agrelon)
g.bind("crm", crm)
g.bind("frbroo", frbroo)
g.bind("dcterms", DCTERMS)
g.bind("schema", schema)
g.bind("owl", OWL)
g.bind("pro", pro)
g.bind("proles", proles)
g.bind("prov", prov)
g.bind("tvc", tvc)

In [30]:
print g.serialize(format='n3')

@prefix agrelon: <https://d-nb.info/standards/elementset/agrelon#> .
@prefix crm: <http://www.cidoc-crm.org/cidoc-crm/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix frbroo: <http://iflastandards.info/ns/fr/frbr/frbroo/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix pro: <http://purl.org/spar/pro/> .
@prefix proles: <http://www.essepuntato.it/2013/10/politicalroles/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <https://schema.org/> .
@prefix tvc: <http://www.essepuntato.it/2012/04/tvc/> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/ev01-time> a <http://www.ontologydesignpatterns.org/cp/owl/timeinterval.owl#TimeInterval> ;
    owl:hasIntervalEndDate "-0399"^^xsd:date ;
    owl:hasIntervalStartDate "-0399"^^xsd:date .

<http://example.org/ev02-time> a

In [31]:
g.serialize(destination="output.xml", format='xml')