## XML and CMDI introduction

In this notebook you will:
- get familiar with [.xml](https://en.wikipedia.org/wiki/XML) file structure
- get familiar with [CMDI](https://www.clarin.eu/content/component-metadata) metadata format 

### XML basics
XML (Extensible Markup Language) is a markup language that does not do anything on its own. Its purpose is to improve machine readibility of data. 
Below you can see an exemplary XML file, that we will use in the following cells to demonstrate field value access.

In [1]:
!cat "./example_xml.xml"

<studentsList>
	<student id="1">
		<firstName>John</firstName>
		<lastName>Snow</lastName>
		<scores>
			<homework1>80</homework1>
			<homework2>70</homework2>
			<project>85</project>
		</scores>
	</student>
	<student id="2">
		<firstName>Jozef</firstName>
		<lastName>Szwejk</lastName>
		<scores>
			<homework1>70</homework1>
			<homework2>65</homework2>
			<project>35</project>
		</scores>
	</student>
        <student id="3">
                <firstName>Franek</firstName>
                <lastName>Dolas</lastName>
                <scores>
                        <homework1>65</homework1>
                        <homework2>80</homework2>
                        <project>45</project>
                </scores>
        </student>
</studentsList>



### XPath
[XPath](https://www.w3.org/TR/1999/REC-xpath-19991116/) allows navigation through nodes in an XML document. [lxml](https://lxml.de/) is a Python package, that provides structure for storing XML document tree and support for XPath.

Below you can see a few examples of how to access nodes of the XML file in Python

In [2]:
# Make sure lxml is installed
!pip install lxml



In [3]:
# Parse the XML file into lxml's element tree
from lxml import etree


def print_xml(tree, declaration: bool = False):
    print(etree.tostring(tree, encoding='UTF-8', xml_declaration=declaration, pretty_print=True).decode())

xml_tree = etree.parse("./example_xml.xml")

# Let's print
print(xml_tree)
print_xml(xml_tree)

<lxml.etree._ElementTree object at 0x7f1c382a5600>
<studentsList>
	<student id="1">
		<firstName>John</firstName>
		<lastName>Snow</lastName>
		<scores>
			<homework1>80</homework1>
			<homework2>70</homework2>
			<project>85</project>
		</scores>
	</student>
	<student id="2">
		<firstName>Jozef</firstName>
		<lastName>Szwejk</lastName>
		<scores>
			<homework1>70</homework1>
			<homework2>65</homework2>
			<project>35</project>
		</scores>
	</student>
        <student id="3">
                <firstName>Franek</firstName>
                <lastName>Dolas</lastName>
                <scores>
                        <homework1>65</homework1>
                        <homework2>80</homework2>
                        <project>45</project>
                </scores>
        </student>
</studentsList>



Below, we show a short example of how to use XPath with _ElementTree_

In [4]:
"""
    Select nodes and values
"""
students = xml_tree.xpath("//student")

# We obtain a list of elements
students

[<Element student at 0x7f1c3a2fd840>,
 <Element student at 0x7f1c382bacc0>,
 <Element student at 0x7f1c381f7380>]

In [5]:
# We can explore content of elements

for student in students:
    print(f"Tag: {student.tag}")
    print(f"Attributes: {student.attrib}")
    print(f"Tree:\n{etree.tostring(student, encoding='UTF-8', xml_declaration=True, pretty_print=True).decode()}")


Tag: student
Attributes: {'id': '1'}
Tree:
<?xml version='1.0' encoding='UTF-8'?>
<student id="1">
		<firstName>John</firstName>
		<lastName>Snow</lastName>
		<scores>
			<homework1>80</homework1>
			<homework2>70</homework2>
			<project>85</project>
		</scores>
	</student>
	

Tag: student
Attributes: {'id': '2'}
Tree:
<?xml version='1.0' encoding='UTF-8'?>
<student id="2">
		<firstName>Jozef</firstName>
		<lastName>Szwejk</lastName>
		<scores>
			<homework1>70</homework1>
			<homework2>65</homework2>
			<project>35</project>
		</scores>
	</student>
        

Tag: student
Attributes: {'id': '3'}
Tree:
<?xml version='1.0' encoding='UTF-8'?>
<student id="3">
                <firstName>Franek</firstName>
                <lastName>Dolas</lastName>
                <scores>
                        <homework1>65</homework1>
                        <homework2>80</homework2>
                        <project>45</project>
                </scores>
        </student>




In [6]:
for student in students:
    # Extract attribute value
    student_id = student.get("id")
    # find textual values of children nodes
    first_name = student.xpath("//firstName/text()")
    last_name = student.xpath("//lastName/text()")
    print(f"ID: {student_id}\tName: {' '.join(first_name)} {' '.join(last_name)}")

ID: 1	Name: John Jozef Franek Snow Szwejk Dolas
ID: 2	Name: John Jozef Franek Snow Szwejk Dolas
ID: 3	Name: John Jozef Franek Snow Szwejk Dolas


Note that in an example above the XPath call to `//firstName/text()` finds all occurences of the first and last name in all the _student_ Elements. In order to select the nodes relative to the Element use `./` at the beginning of the XPath.

In [7]:
for student in students:
    student_id = student.get("id")
    first_name = student.xpath("./firstName/text()")
    last_name = student.xpath("./lastName/text()")
    print(f"ID: {student_id}\tName: {' '.join(first_name)} {' '.join(last_name)}")

ID: 1	Name: John Snow
ID: 2	Name: Jozef Szwejk
ID: 3	Name: Franek Dolas


We can build conditional statements on children values and attribute values using `[]`

In [8]:
"""
    XPath and conditioning on attributes and values
"""

# All students that obtained 40<= points have passed
students_that_passed = xml_tree.xpath("//student[./scores/homework1 >= 40 and ./scores/homework2 >= 40 and ./scores/project >= 40]")
for student in students_that_passed:
    print_xml(student)

<student id="1">
		<firstName>John</firstName>
		<lastName>Snow</lastName>
		<scores>
			<homework1>80</homework1>
			<homework2>70</homework2>
			<project>85</project>
		</scores>
	</student>
	

<student id="3">
                <firstName>Franek</firstName>
                <lastName>Dolas</lastName>
                <scores>
                        <homework1>65</homework1>
                        <homework2>80</homework2>
                        <project>45</project>
                </scores>
        </student>




In [9]:
# Select all students with id>1. Note that `id` being a string, XPath
# allows polymorphism on attributes and node values
students_with_required_id = xml_tree.xpath("./student[@id > 1]")
for student in students_with_required_id:
    print_xml(student)

<student id="2">
		<firstName>Jozef</firstName>
		<lastName>Szwejk</lastName>
		<scores>
			<homework1>70</homework1>
			<homework2>65</homework2>
			<project>35</project>
		</scores>
	</student>
        

<student id="3">
                <firstName>Franek</firstName>
                <lastName>Dolas</lastName>
                <scores>
                        <homework1>65</homework1>
                        <homework2>80</homework2>
                        <project>45</project>
                </scores>
        </student>




### Namespaces, schemas and CMDI

An XML document allows for multiple schemas and namespaces to be used within a single document, which helps with aggregating information from multiple sources, or providing it in parallel formats. However, when present, XPath requires explicit mapping of namespaces to Uniform Resource Identifier (URI).

Let's now look at the real world example of the metadata we will explore further in this tutorial. Specificially, we will look at the [CMDI](https://www.clarin.eu/content/component-metadata) metadata description for available OCR data from a historical Latvian newspaper _Valmieras Ziņojumu Lapa_.

In [10]:
# Let's explore 
xml_tree = etree.parse("./example_cmdi.xml")
print_xml(xml_tree, True)

<?xml version='1.0' encoding='UTF-8'?>
<cmd:CMD xmlns:cmd="http://www.clarin.eu/cmd/1" xmlns="http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1639731773869" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" CMDVersion="1.2" xsi:schemaLocation="http://www.clarin.eu/cmd/1         http://infra.clarin.eu/CMDI/1.x/xsd/cmd-envelop.xsd         http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1639731773869         https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.x/profiles/clarin.eu:cr1:p_1639731773869/xsd">
  <cmd:Header>
    <cmd:MdCreator>aggregation_cmdi_creation.py</cmd:MdCreator>
    <cmd:MdCreationDate>2022-04-21</cmd:MdCreationDate>
    <cmd:MdSelfLink>https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/9200303/Valmieras_Zi_ojumu_Lapa_collection.xml</cmd:MdSelfLink>
    <cmd:MdProfile>clarin.eu:cr1:p_1639731773869</cmd:MdProfile>
    <cmd:MdCollectionDisplayName>Europeana newspapers full-text</cmd:MdCollectionDisplayName>
  </cmd:Header>
  <cmd:Resour

Let's focus on the XML document declaration:
```xml
<?xml version='1.0' encoding='UTF-8'?>
<cmd:CMD xmlns:cmd="http://www.clarin.eu/cmd/1" xmlns="http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1639731773869" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" CMDVersion="1.2" xsi:schemaLocation="http://www.clarin.eu/cmd/1         http://infra.clarin.eu/CMDI/1.x/xsd/cmd-envelop.xsd         http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1639731773869         https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.x/profiles/clarin.eu:cr1:p_1639731773869/xsd">
```

`xmlns:` allows for namespace declaration. Default namespace (if none present) is declared by `xmlns`.

XPath does not support an implicit default namespace. In order to explore ElementTree with namespaces, we need to pass mapping explicitly.  

In [11]:
print(f"That does not work: {xml_tree.xpath('//Language/code/text()')}")

# Just specify artificial default namespace
nsmap = {"cmd": "http://www.clarin.eu/cmd/1",
         "dft": "http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1639731773869"}

print(f"That does work: {xml_tree.xpath('//dft:Language/dft:code/text()', namespaces=nsmap)}")

That does not work: []
That does work: ['deu', 'lav', 'rus']


Let's explore the collection metadata. Remember that we have the following:
```xml
...
    <Description>
        <description>Full text content aggregated from Europeana. Title: "Valmieras Ziņojumu Lapa". Years: 1900, 1901, 1903, 1904, 1905, 1906.</description>
    </Description>
...
```

In [12]:
print(xml_tree.xpath('//dft:Description/dft:description/text()', namespaces=nsmap))

['Full text content aggregated from Europeana. Title: "Valmieras Ziņojumu Lapa". Years: 1900, 1901, 1903, 1904, 1905, 1906.']


In [13]:
resource_list = xml_tree.xpath('//cmd:ResourceProxyList', namespaces=nsmap)

In [14]:
resource_list = resource_list[0]
print_xml(resource_list, True)

<?xml version='1.0' encoding='UTF-8'?>
<cmd:ResourceProxyList xmlns:cmd="http://www.clarin.eu/cmd/1" xmlns="http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1639731773869" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
      <cmd:ResourceProxy id="landing_page">
        <cmd:ResourceType>LandingPage</cmd:ResourceType>
        <cmd:ResourceRef>https://pro.europeana.eu/page/iiif#download</cmd:ResourceRef>
      </cmd:ResourceProxy>
      <cmd:ResourceProxy id="archive_edm">
        <cmd:ResourceType mimetype="application/zip">Resource</cmd:ResourceType>
        <cmd:ResourceRef>ftp://download.europeana.eu/newspapers/fulltext/edm_issue/9200303.zip</cmd:ResourceRef>
      </cmd:ResourceProxy>
      <cmd:ResourceProxy id="archive_alto">
        <cmd:ResourceType mimetype="application/zip">Resource</cmd:ResourceType>
        <cmd:ResourceRef>ftp://download.europeana.eu/newspapers/fulltext/alto/9200303.zip</cmd:ResourceRef>
      </cmd:ResourceProxy>
      <cmd:ResourceProxy id="_1

We can access metadata about resources by year of issue.

In [15]:
resource_id = resource_list.xpath("./cmd:ResourceProxy[./cmd:ResourceType/text() = 'Metadata']/@id", namespaces=nsmap)
print(resource_id)

['_1900', '_1901', '_1903', '_1904', '_1905', '_1906']


In [16]:
_id = resource_id[0]
print(f"Ref ID: {_id}")
ref_1900 = resource_list.xpath("./cmd:ResourceProxy[@id = '_1900']/cmd:ResourceRef/text()", namespaces=nsmap)
print(f"Ref link: {ref_1900}")
ref_1900 = ref_1900[0]

Ref ID: _1900
Ref link: ['https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/9200303/Valmieras_Zi_ojumu_Lapa_1900.xml']


We can resolve referenced metadata in order to learn how the metadata links to the underlying bibliographic resources. 

In [17]:
import requests

response = requests.get(ref_1900)
xml_tree_1900 = etree.fromstring(response.content)
print_xml(xml_tree_1900)

<cmd:CMD xmlns:cmd="http://www.clarin.eu/cmd/1" xmlns="http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1633000337997" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" CMDVersion="1.2" xsi:schemaLocation="http://www.clarin.eu/cmd/1     http://infra.clarin.eu/CMDI/1.x/xsd/cmd-envelop.xsd     http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1633000337997     https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.x/profiles/clarin.eu:cr1:p_1633000337997/xsd">
  <cmd:Header>
    <cmd:MdCreator>aggregation_cmdi_creation.py</cmd:MdCreator>
    <cmd:MdCreationDate>2022-04-21</cmd:MdCreationDate>
    <cmd:MdSelfLink>https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/9200303/Valmieras_Zi_ojumu_Lapa_1900.xml</cmd:MdSelfLink>
    <cmd:MdProfile>clarin.eu:cr1:p_1633000337997</cmd:MdProfile>
    <cmd:MdCollectionDisplayName>Europeana newspapers full-text</cmd:MdCollectionDisplayName>
  </cmd:Header>
  <cmd:Resources>
    <cmd:ResourceProxyList>
      <cmd:ResourceProxy

Let's investigate Subresource element of the XML tree

In [18]:
nsmap["dft"] = "http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1633000337997"
subresources = xml_tree_1900.xpath("//dft:Subresource", namespaces=nsmap)

for subresource in subresources[3]:
    print_xml(subresource)

<SubresourceDescription xmlns="http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1633000337997" xmlns:cmd="http://www.clarin.eu/cmd/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
          <label>Valmieras Ziņojumu Lapa - 1900-01-15</label>
          <IdentificationInfo>
            <identifier>http://data.theeuropeanlibrary.org/BibliographicResource/3000059947291</identifier>
            <identifier>3000059947291</identifier>
          </IdentificationInfo>
          <Language>
            <name>German</name>
            <code>deu</code>
          </Language>
          <Language>
            <name>Latvian</name>
            <code>lav</code>
          </Language>
          <Language>
            <name>Russian</name>
            <code>rus</code>
          </Language>
          <TemporalCoverage>
            <label>1900-01-15</label>
            <Start>
              <date>1900-01-15</date>
            </Start>
            <End>
              <date>1900-01-15</date>
        

Let's select all issues from January 1900 and retrieve their download link

In [19]:
import regex as re

In [20]:
january_subresources = [subresource for subresource in subresources 
                          if subresource.xpath("./dft:SubresourceDescription/dft:TemporalCoverage/dft:label[contains(text(), '-01-')]", namespaces=nsmap)]

cmd_refs = [subresource.attrib for subresource in january_subresources]
cmd_refs = [subresource.xpath("./@cmd:ref", namespaces=nsmap)[0] for subresource in january_subresources]

direct_data_links = [xml_tree_1900.xpath(f"//cmd:ResourceProxy[@id='{cmd_ref}']/cmd:ResourceRef/text()", namespaces=nsmap)[0] for cmd_ref in cmd_refs]
print(direct_data_links)

['https://www.europeana.eu/item/9200303/BibliographicResource_3000059947291', 'https://www.europeana.eu/item/9200303/BibliographicResource_3000059947285']


More concrete examples of data retrieval will follow in Chapter 2. Hopefully you got a grasp on the linkage between metadata and textual resources. 