# Extracting RDF graphs from TEI/XML documents using lxml.etree and RDFLib 

## 1. Introduction

This Jupyter notebook is a step-by-step guide to the extraction of RDF graphs from TEI/XML documents using lxml and RDFLib, as suggested by LIFT.   
LIFT is an open-source web application based entirely on Python. The aim of LIFT is to show and demonstrate how it is possible to extract RDF graphs, supported by widely adopted ontological vocabularies, from TEI/XML documents.  
This notebook will show you how to leverage the lxml.etree library to parse TEI/XML documents and the RDFLib library to build RDF statements using the information extracted from the TEI input file.

 
**TEI/XML** - the standard vocabulary for textual encoding in the humanities <https://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html>  
**lxml.etree** - a Python library for XML processing  
**RDFLib** - a Python library for working with RDF <https://rdflib.readthedocs.io/en/stable/index.html>

## 2. Installing lxml and RDFLib

Firstly, if you do not already have it, install lxml onto your computer by following the instructions provided at this link: <https://lxml.de/installation.html>.  
  
Do the same for RDFLib. Information on how to install the library is available at <https://rdflib.readthedocs.io/en/stable/gettingstarted.html>.

## 3. Getting started with the TEI to RDF extraction script 

The following blocks of code are ideally stored into a single Python file, which you can create and name something like `TEItoRDF.py`. Alternatively, remember that you can download this Jupyter notebook as a Python file by clicking on File > Download as > Python (.py). Let's go!

### 3.1 Importing lxml

Starting with an empty Python file, we begin by importing lxml.etree (a library for processing XML using Python, cf. section 1) into our script:

In [20]:
from lxml import etree

To read from a TEI/XML file (further on referred to as 'input' or 'TEI document'), we use the `parse()` function:

In [21]:
tree = etree.parse('input.xml')

Make sure to specify the correct path. In this case, the file `input.xml` is stored in the current folder. For a basic introduction to paths see <https://www.w3schools.com/html/html_filepaths.asp>.

In order to retrieve the root element of the TEI document (i.e. `input.xml`), we use the function `getroot()` and store the result in the 'root' variable:

In [22]:
root = tree.getroot()

We also assign the values of the TEI attributes `@xml:base` and `@xml:id`, which are attached to the root element of the TEI document, to the variables 'base_uri' and 'edition_id' respectively. These will come handy when generating entity URIs.  
In order to retrieve the attributes we leverage the `get()` function (note how we substituted the prefix 'xml' with the actual namespace, this is the canonical way of working with attributes belonging to the xml namespace in lxml):

In [23]:
base_uri = root.get('{http://www.w3.org/XML/1998/namespace}base')
edition_id = root.get('{http://www.w3.org/XML/1998/namespace}id')

We then bind the TEI namespace to the prefix 'tei' (we will use this later to refer to TEI elements) as follows:

In [24]:
tei = {'tei': 'http://www.tei-c.org/ns/1.0'}

### 3.2 Importing RDFLib

Firstly, we import the Graph, Literal, BNode, Namespace and URIRef classes from RDFLib as follows:

In [25]:
from rdflib import Graph, Literal, BNode, Namespace, URIRef

Secondly, we declare the namespaces of the ontological vocabularies that are going to provide the semantics of the resulting RDF graph.
Some namespaces are available by direct import from RDFLib so we can simply type:

In [26]:
from rdflib.namespace import RDF, RDFS, XSD, DCTERMS, OWL

Any other namespace is to be declared in the following way (these are the ontologies used in LIFT):

In [27]:
agrelon = Namespace("https://d-nb.info/standards/elementset/agrelon#")
crm = Namespace("http://www.cidoc-crm.org/cidoc-crm/")
frbroo = Namespace("http://iflastandards.info/ns/fr/frbr/frbroo/")
pro = Namespace("http://purl.org/spar/pro/")
proles = Namespace("http://www.essepuntato.it/2013/10/politicalroles/")
prov = Namespace("http://www.w3.org/ns/prov#")
schema = Namespace("https://schema.org/")
tvc = Namespace("http://www.essepuntato.it/2012/04/tvc/")

An RDFLib graph is a set of RDF triples. We declare our output graph and name it 'g':

In [28]:
g = Graph()

Using the function bind(), we bind each of our namespaces to a prefix:

In [29]:
g.bind("agrelon", agrelon)
g.bind("crm", crm)
g.bind("frbroo", frbroo)
g.bind("dcterms", DCTERMS)
g.bind("schema", schema)
g.bind("owl", OWL)
g.bind("pro", pro)
g.bind("proles", proles)
g.bind("prov", prov)
g.bind("tvc", tvc)

## Extracting RDF statements about persons

For each person in the TEI document, we:
1. Extract the person's xml:id
2. Build a unique URI for the person by concatenating the 'base_uri' from above with the person's xml:id. In order to make clear what kind of resource the URI represents, we also add the directory '/person/' before the actual person's xml:id.
3. We add our first triple to the RDF graph: the person's URI is the subject, followed by the predicate rdf:type, and the class schema:Person. By doing so, we assign the individual person to the ontological class schema:Person (<https://schema.org/Person>).

In [30]:
for person in root.findall('.//tei:person', tei):
    person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
    person_uri = URIRef(base_uri + '/person/' + person_id)
    g.add( (person_uri, RDF.type, schema.Person))

Now run the following print() functions to print out the triples just generated. RDFLib allows us to choose among different serialization formats, such as xml, n3, and nt.

In [13]:
print(g.serialize(format='xml'))
print(g.serialize(format='n3'))
print(g.serialize(format='nt'))

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>
  <rdf:Description rdf:about="http://example.org/person/Socr">
    <rdf:type rdf:resource="https://schema.org/Person"/>
  </rdf:Description>
  <rdf:Description rdf:about="http://example.org/person/Plat">
    <rdf:type rdf:resource="https://schema.org/Person"/>
  </rdf:Description>
  <rdf:Description rdf:about="http://example.org/person/Aristot">
    <rdf:type rdf:resource="https://schema.org/Person"/>
  </rdf:Description>
  <rdf:Description rdf:about="http://example.org/person/Xen">
    <rdf:type rdf:resource="https://schema.org/Person"/>
  </rdf:Description>
  <rdf:Description rdf:about="http://example.org/person/Criti">
    <rdf:type rdf:resource="https://schema.org/Person"/>
  </rdf:Description>
</rdf:RDF>

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .


We now look for a sameAs attribute associated with the person. This attribute should contain one or more URI pointing to authority records, such as VIAF, or to other resources about the same person, for example from DBpedia:
1. Using the get() function, we look for a sameAs attribute and split its contents by whitespace (line 4).
2. We loop through the list of URIs as many times as the total number of URIs stored in the sameAs attribute (lines 5-9), store the URIs in a variable 'same_as_uri' (line 7), then add a triple to the RDF graph at each loop (line 8): the person's URI is the subject, followed by the predicate owl:sameAs, and the retrieved sameAs URI. For example, if a sameAs attribute contains two URIs, two distinct RDF triples are added to the graph. 

In [None]:
for person in root.findall('.//tei:person', tei):
    person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
    person_uri = URIRef(base_uri + '/person/' + person_id)
    same_as = person.get('sameAs').split()
    i = 0
    while i < len(same_as):
        same_as_uri = URIRef(same_as[i])
        g.add( (person_uri, OWL.sameAs, same_as_uri))
        i += 1

The next step is to provide each person's entity with a human-readable label, linked to the subject via the RDF property 'rdf:label'. In order to do so:
1. We iterate again through all persons looking for personal names, i.e. a child element <persName> (lines 1-4).
2. We then store the contents of such an element in the 'label' variable, as well as look for an xml:lang attribute associated with the label (line 6-7).
3. If an xml:lang is found, the script adds a triple featuring the person whose rdf:label is a Literal value to which a language declaration is attached (line 8).
4. Otherwise, the script creates a triple whithout declaring any specific language (line 10).

In [None]:
for person in root.findall('.//tei:person', tei):
    person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
    person_uri = URIRef(base_uri + '/person/' + person_id)
    persname = person.find('./tei:persName', tei)
    label = persname.text 
    if persname.get('{http://www.w3.org/XML/1998/namespace}lang') is not None:
        label_lang = persname.get('{http://www.w3.org/XML/1998/namespace}lang')
        g.add( (person_uri, RDFS.label, Literal(label, lang=label_lang)))
    else:
        g.add( (person_uri, RDFS.label, Literal(label)))

In TEI, groups of somehow related <person> elements (e.g. they are of the same type) are usually nested within a common <listPerson> element. The following script retrieves any potential @type or @corresp attributes on <listPerson>. These should contain a natural language description of the person's type or anauthority record URI respectively:
1. We look for a <listPerson> parent element (line 4).
2. We retrieve the attributes @type and/or @corresp (lines 5-6).
3. If a @type attribute was found, we add an RDF triple formed by the person's URI, the property dcterms:description and a Literal value containing a natural language description of the person's type (lines 7-8).
4. If a @corresp attribute was found, we add an RDF triple formed by the person's URI, the property dcterms:subject and a URI (ideally) of an authority record (lines 9-10).

In [None]:
for person in root.findall('.//tei:person', tei):
    person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
    person_uri = URIRef(base_uri + '/person/' + person_id)
    listperson = person.find('./...', tei)
    perstype = listperson.get('type')
    perscorr = listperson.get('corresp')
    if perstype is not None:
        g.add( (person_uri, DCTERMS.description, Literal(perstype)))
    if perscorr is not None and perscorr.startswith('http'):
        g.add( (person_uri, DCTERMS.subject, URIRef(perscorr)))

You may also be interested in extracting all references to a particular person in the text. The following script does precisely this:
1. It looks for any reference to the person, i.e. any <persName> element in the text whose @ref attributes corresponds to the @xml:id of the person (lines 3-4).
2. It retrieves the parent element of the <persName> and creates a unique URI for it (lines 6-7).
3. It adds an RDF statement which has the person as a subject, followed by the property dcterms:isReferencedBy, and the parent element's URI (line 8).
4. It adds two RDF statements describing the parent element's entity, which is a frbroo:F23_Expression_Fragment (cf. <http://iflastandards.info/ns/fr/frbr/frbroo/F23>) part of the TEI file.  

In [17]:
for person in root.findall('.//tei:person', tei):
    person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')  
    ref = './tei:text//tei:persName[@ref="#' + person_id + '"]'
    for referenced_person in root.findall(ref, tei):
        parent = referenced_person.getparent()
        parent_id = parent.get('{http://www.w3.org/XML/1998/namespace}id')
        parent_uri = URIRef(base_uri + '/text/' + parent_id)
        g.add( (person_uri, DCTERMS.isReferencedBy, parent_uri))
        g.add( (parent_uri, RDF.type, frbroo.F23_Expression_Fragment))
        g.add( (parent_uri, frbroo.R15i_is_fragment_of, URIRef(base_uri + '/' + edition_id)))

Our person's description is complete. Note that we could also write the code above by dividing it into smaller functions (e.g. def function_name()), then call each one of the functions at the end.  

In [9]:
def subject(person):
    g.add( (person_uri, RDF.type, schema.Person))
    
def sameas(person):    
    same_as = person.get('sameAs').split()
    i = 0
    while i < len(same_as):
        same_as_uri = URIRef(same_as[i])
        g.add( (person_uri, OWL.sameAs, same_as_uri))
        i += 1
        
def persname(person):
    persname = person.find('./tei:persName', tei)
    label = persname.text
    label_lang = persname.get('{http://www.w3.org/XML/1998/namespace}lang')
    if label_lang is not None:
        g.add( (person_uri, RDFS.label, Literal(label, lang=label_lang)))
    else:
        g.add( (person_uri, RDFS.label, Literal(label)))
        
def perstype(person):
    listperson = person.find('./...', tei)
    perstype = listperson.get('type')
    perscorr = listperson.get('corresp')
    if perstype is not None:
        g.add( (person_uri, DCTERMS.description, Literal(perstype)))
    if perscorr is not None and perscorr.startswith('http'):
        g.add( (person_uri, DCTERMS.subject, URIRef(perscorr)))
        
def referenced_person(person_id):
    ref = './tei:text//tei:persName[@ref="#' + person_id + '"]'
    for referenced_person in root.findall(ref, tei):
        parent = referenced_person.getparent()
        parent_id = parent.get('{http://www.w3.org/XML/1998/namespace}id')
        parent_uri = URIRef(base_uri + '/text/' + parent_id)
        g.add( (person_uri, DCTERMS.isReferencedBy, parent_uri))
        g.add( (parent_uri, RDF.type, frbroo.F23_Expression_Fragment))
        g.add( (parent_uri, frbroo.R15i_is_fragment_of, URIRef(base_uri + '/' + edition_id)))
        
# Calling all functions
for person in root.findall('.//tei:person', tei):
    person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
    person_uri = URIRef(base_uri + '/person/' + person_id)
    person_ref = '#' + person_id
    subject(person)
    sameas(person)
    persname(person)
    referenced_person(person_id)
    perstype(person)

## Extracting RDF statements about events

In [15]:
def partic_event(person):     
    for event in person.findall('./tei:event', tei):
        event_id = event.get('{http://www.w3.org/XML/1998/namespace}id')
        partic_event_uri = URIRef(base_uri + '/' + person_id + '-in-' + event_id)
        if event is not None:
            g.add( (person_uri, pro.holdsRoleInTime, partic_event_uri))

In [16]:
def role_in_event(person):
    for event in person.findall('./tei:event', tei):
        event_id = event.get('{http://www.w3.org/XML/1998/namespace}id')
        persName = person.find('./tei:persName', tei)
        label = persName.text
        rit_uri = URIRef(base_uri + '/rit/' + person_id + '-at-' + event_id)
        pers_in_event = event.find('./tei:desc/tei:persName', tei)
    
        g.add( (rit_uri, RDF.type, pro.RoleInTime))
        
        if pers_in_event is not None and pers_in_event.get('ref') == person_ref and pers_in_event.get('role') is not None:
            role_uri = URIRef(base_uri + '/role/' + pers_in_event.get('role'))
            g.add( (rit_uri, pro.withRole, role_uri))
            g.add( (role_uri, RDF.type, pro.Role))
        
        if pers_in_event.get('corresp') is not None:
            g.add( (role_uri, OWL.sameAs, pro.Role))
            g.add( (role_uri, RDFS.label, pro.Role))
            corresp_role_uri = URIRef(pers_in_event.get('corresp'))
            g.add( (role_uri, OWL.sameAs, corresp_role_uri))
            role_label = pers_in_event.get('role')
            g.add( (role_uri, RDFS.label, Literal(role_label)))
        else:
            g.add( (rit_uri, pro.withRole, URIRef(base_uri + '/role/participant')))
            role_uri = URIRef(base_uri + '/role/participant')
            g.add( (role_uri, RDF.type, pro.Role))
            g.add( (role_uri, OWL.sameAs, URIRef('http://wordnet-rdf.princeton.edu/id/10421528-n')))
            g.add( (role_uri, RDFS.label, Literal('participant'))) 

        g.add( (rit_uri, tvc.atTime, URIRef(base_uri + '/tvc/' + event_id + '-time')))
        g.add( (rit_uri, pro.relatesToEntity, URIRef(base_uri + '/event/' + event_id)))

        place = event.find('./tei:desc/tei:placeName', tei)
        if place > 1:
            place_of_event = place.get('type="place_of_event"')
            g.add( (rit_uri, proles.relatesToPlace, URIRef(base_uri + '/place/' + place.get('ref').replace("#", ""))))
        elif event.find('./tei:desc/tei:placeName', tei) == 1:
            g.add( (rit_uri, proles.relatesToPlace, URIRef(base_uri + '/place/' + place.get('ref').replace("#", ""))))       

In [17]:
def event_time():
    g.add( (event_time_uri, RDF.type, URIRef('http://www.ontologydesignpatterns.org/cp/owl/timeinterval.owl#TimeInterval')))
    if event.get('when') is not None:
        g.add( (event_time_uri, OWL.hasIntervalStartDate, Literal(event.get('when'), datatype=XSD.date)))
        g.add( (event_time_uri, OWL.hasIntervalEndDate, Literal(event.get('when'), datatype=XSD.date)))
    if event.get('from') is not None:
        g.add( (event_time_uri, OWL.hasIntervalStartDate, Literal(event.get('from'), datatype=XSD.date)))
    if event.get('to') is not None:
        g.add( (event_time_uri, OWL.hasIntervalEndDate, Literal(event.get('to'), datatype=XSD.date)))

In [18]:
def event_desc():
    g.add( (event_uri, RDF.type, crm.E5_Event))
    g.add( (event_uri, RDF.type, schema.Event))
    if event.find('./tei:label', tei) is not None:
        label = event.find('./tei:label', tei).text
        g.add( (event_uri, RDFS.label, Literal(label)))
    if evtype is not None:
        g.add( (event_uri, DCTERMS.description, Literal(evtype)))
    if evcorr is not None and evcorr.startswith('http'):
        g.add( (event_uri, DCTERMS.subject, URIRef(evcorr)))

In [19]:
def event_source():
    source = event.find('./tei:bibl', tei)
    if source is not None:
        source_id = source.get('{http://www.w3.org/XML/1998/namespace}id')
        source_uri = URIRef(base_uri + '/source/' + source_id)
        g.add( (event_uri, prov.hasPrimarySource, source_uri))
        for event_source in root.findall('.//tei:event//tei:bibl', tei):
            g.add( (source_uri, RDF.type, prov.PrimarySource))
            if event_source.find('./tei:author', tei) is not None and event_source.find('./tei:author', tei).get('ref') is not None:
                author_ref = event_source.find('./tei:author', tei).get('ref')
                author_id = author_ref.split('#')
                g.add( (source_uri, DCTERMS.creator, URIRef(base_uri + '/person/' + author_id[1])))
            if event_source.find('.tei:title', tei) is not None:
                g.add( (source_uri, DCTERMS.title, Literal(event_source.find('.tei:title', tei).text)))
            if event_source.get('sameAs') is not None:
                sameAs = event_source.get('sameAs')
                if sameAs.startswith('http'):
                    g.add( (source_uri, OWL.sameAs, URIRef(event_source.get('sameAs')))) 
            if event_source.find('.tei:date', tei) is not None:
                evdate = event_source.find('.tei:date', tei)
                g.add( (source_uri, DCTERMS.date, Literal(evdate.get('when'), datatype=XSD.date)))

Call functions

In [20]:
for person in root.findall('.//tei:person', tei):
    person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
    person_uri = URIRef(base_uri + '/person/' + person_id)
    person_ref = '#' + person_id
    partic_event(person)
    role_in_event(person)

In [21]:
for event in root.findall('.//tei:event', tei):
    event_id = event.get('{http://www.w3.org/XML/1998/namespace}id')
    event_time_uri = URIRef(base_uri + '/' + event_id + '-time')
    event_uri = URIRef(base_uri + '/event/' + event_id)
    evcorr = event.get('corresp')
    evtype = event.get('type')
    event_time()
    event_desc()
    event_source()

## Extracting RDF statements about relations

The aim of the following script is to extract information about the relationships to which a person participates. In TEI, relationships are normally encoded using the element <relation>, nested within the <listPerson> element. There are two main types of relationships: active/passive (unilateral relationship, e.g. Person A (active) is mother of Person B (passive)) and mutual (mutual relationship, e.g. Person A/B is colleague of Person B/A). 
1. For each person, the script iterates through all <relation> elements (lines 1-2).
2. If an @active attribute containing a reference to the person is found on <relation> (line 3), the script iterates through all possible values of the @passive attribute adding an RDF triple for each of them (lines 4-8). The @name attribute on <relation> should provide a term from a vocabulary such as Agrelon (cf. line 7).
3. The same is done for mutual relationships. 

In [22]:
def relation(person):
    for relation in root.findall('.//tei:listRelation/tei:relation', tei):
        if relation.get('active') is not None and relation.get('active') == person_ref:
            passive = relation.get('passive').replace("#", "").split()
            i = 0
            while i < len(passive):
                g.add( (person_uri, agrelon[relation.get('name')], URIRef(base_uri + '/' + passive[i])))
                i += 1
        elif relation.get('mutual') is not None:
            relentity = relation.get('mutual').split()
            if person_ref in relentity:
                mutual = relation.get('mutual').replace("#", "").replace(person_id, "").split()
                i = 0
                while i < len(mutual):
                    g.add( (person_uri, agrelon[relation.get('name')], URIRef(base_uri + '/' + mutual[i])))
                    i += 1

In [23]:
for person in root.findall('.//tei:person', tei):
    person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')
    person_uri = URIRef(base_uri + '/person/' + person_id)
    person_ref = '#' + person_id
    relation(person)

## Extracting RDF statements about places

In [24]:
def place_subject(place):
    g.add( (place_uri, RDF.type, schema.Place))

In [25]:
def place_sameas(place):
    same_as = place.get('sameAs').split()
    i = 0
    while i < len(same_as):
        same_as_uri = URIRef(same_as[i])
        g.add( (place_uri, OWL.sameAs, same_as_uri))
        i += 1

In [26]:
def placename(place):
    placename = place.find('./tei:placeName', tei)
    label = placename.text
    label_lang = placename.get('{http://www.w3.org/XML/1998/namespace}lang')
    if label_lang is not None:
        g.add( (place_uri, RDFS.label, Literal(label, lang=label_lang)))
    else:
        g.add( (place_uri, RDFS.label, Literal(label)))

In [27]:
def referenced_place(place_id):
    ref = './/tei:placeName[@ref="#' + place_id + '"]'
    for referenced_place in root.findall(ref, tei):
        parent = referenced_place.getparent()
        parent_id = parent.get('{http://www.w3.org/XML/1998/namespace}id')
        parent_uri = URIRef(base_uri + '/text/' + parent_id)
        g.add( (place_uri, DCTERMS.isReferencedBy, parent_uri))
        g.add( (parent_uri, RDF.type, frbroo.F23_Expression_Fragment))
        g.add( (parent_uri, frbroo.R15i_is_fragment_of, URIRef(base_uri + '/' + edition_id)))

Call functions

In [28]:
for place in root.findall('.//tei:place', tei):
    place_id = place.get('{http://www.w3.org/XML/1998/namespace}id')
    place_uri = URIRef(base_uri + '/place/' + place_id)
    place_ref = '#' + place_id
    place_subject(place)
    place_sameas(place)
    placename(place)
    referenced_place(place_id)

In [19]:
g.serialize(destination="output.xml", format='xml')