## XML and CMDI introduction (maybe VLO as well, as its our proxy for the metadata)

In this notebook you will:
- get familiar with [.xml](https://en.wikipedia.org/wiki/XML) file structure
- get familiar with [CMDI](https://www.clarin.eu/content/component-metadata) metadata format 

### XML basics
XML (Extensible Markup Language) is a markup language that does not do anything on its own. Its purpose is to improve machine readibility of data. 
Below you can see an exemplary XML file, that we will use in following cells for field value access demonstration.

In [1]:
!cat "./example_xml.xml"

<studentsList>
	<student id="1">
		<firstName>John</firstName>
		<lastName>Snow</lastName>
		<scores>
			<homework1>80</homework1>
			<homework2>70</homework2>
			<project>85</project>
		</scores>
	</student>
	<student id="2">
		<firstName>Jozef</firstName>
		<lastName>Szwejk</lastName>
		<scores>
			<homework1>70</homework1>
			<homework2>65</homework2>
			<project>35</project>
		</scores>
	</student>
        <student id="3">
                <firstName>Franek</firstName>
                <lastName>Dolas</lastName>
                <scores>
                        <homework1>65</homework1>
                        <homework2>80</homework2>
                        <project>45</project>
                </scores>
        </student>
</studentsList>



### XPath
[XPath](https://www.w3.org/TR/1999/REC-xpath-19991116/) allows navigation through nodes in an XML document. [lxml](https://lxml.de/) is a Python package, that provides structure for storing XML document tree and support for XPath.

Below you can see few examples of accessing nodes of xml file in Python

In [2]:
# Make sure lxml is installed
!pip install lxml



In [3]:
# Parse xml file into lxml's element tree
from lxml import etree


def print_xml(tree, declaration: bool = False):
    print(etree.tostring(tree, encoding='UTF-8', xml_declaration=declaration, pretty_print=True).decode())

xml_tree = etree.parse("./example_xml.xml")

# Let's print
print(xml_tree)
print_xml(xml_tree)

<lxml.etree._ElementTree object at 0x7f460444bcc0>
<studentsList>
	<student id="1">
		<firstName>John</firstName>
		<lastName>Snow</lastName>
		<scores>
			<homework1>80</homework1>
			<homework2>70</homework2>
			<project>85</project>
		</scores>
	</student>
	<student id="2">
		<firstName>Jozef</firstName>
		<lastName>Szwejk</lastName>
		<scores>
			<homework1>70</homework1>
			<homework2>65</homework2>
			<project>35</project>
		</scores>
	</student>
        <student id="3">
                <firstName>Franek</firstName>
                <lastName>Dolas</lastName>
                <scores>
                        <homework1>65</homework1>
                        <homework2>80</homework2>
                        <project>45</project>
                </scores>
        </student>
</studentsList>



Below we will present short example of xPath usage with ElementTree

In [4]:
"""
    Select nodes and values
"""
students = xml_tree.xpath("//student")

# We obtain a list of elements
students

[<Element student at 0x7f46043c6140>,
 <Element student at 0x7f46040c3280>,
 <Element student at 0x7f46040c32c0>]

In [5]:
# We can explore content of elements

for student in students:
    print(f"Tag: {student.tag}")
    print(f"Attributes: {student.attrib}")
    print(f"Tree:\n{etree.tostring(student, encoding='UTF-8', xml_declaration=True, pretty_print=True).decode()}")


Tag: student
Attributes: {'id': '1'}
Tree:
<?xml version='1.0' encoding='UTF-8'?>
<student id="1">
		<firstName>John</firstName>
		<lastName>Snow</lastName>
		<scores>
			<homework1>80</homework1>
			<homework2>70</homework2>
			<project>85</project>
		</scores>
	</student>
	

Tag: student
Attributes: {'id': '2'}
Tree:
<?xml version='1.0' encoding='UTF-8'?>
<student id="2">
		<firstName>Jozef</firstName>
		<lastName>Szwejk</lastName>
		<scores>
			<homework1>70</homework1>
			<homework2>65</homework2>
			<project>35</project>
		</scores>
	</student>
        

Tag: student
Attributes: {'id': '3'}
Tree:
<?xml version='1.0' encoding='UTF-8'?>
<student id="3">
                <firstName>Franek</firstName>
                <lastName>Dolas</lastName>
                <scores>
                        <homework1>65</homework1>
                        <homework2>80</homework2>
                        <project>45</project>
                </scores>
        </student>




In [6]:
for student in students:
    # Extract attribute value
    student_id = student.get("id")
    # find textual values of children nodes
    first_name = student.xpath("//firstName/text()")
    last_name = student.xpath("//lastName/text()")
    print(f"ID: {student_id}\tName: {' '.join(first_name)} {' '.join(last_name)}")

ID: 1	Name: John Jozef Franek Snow Szwejk Dolas
ID: 2	Name: John Jozef Franek Snow Szwejk Dolas
ID: 3	Name: John Jozef Franek Snow Szwejk Dolas


Note, that in an example above XPath call to `//firstName/text()` finds all occurences of first and last name in all student Elements. In order to select nodes relative to the Element use `./` at the beginning of the XPath.

In [7]:
for student in students:
    student_id = student.get("id")
    first_name = student.xpath("./firstName/text()")
    last_name = student.xpath("./lastName/text()")
    print(f"ID: {student_id}\tName: {' '.join(first_name)} {' '.join(last_name)}")

ID: 1	Name: John Snow
ID: 2	Name: Jozef Szwejk
ID: 3	Name: Franek Dolas


We can build conditional statements on children values and attribute values using `[]`

In [8]:
"""
    XPath and conditioning on attributes and values
"""

# All students that obtained 40<= points have passed
students_that_passed = xml_tree.xpath("//student[./scores/homework1 >= 40 and ./scores/homework2 >= 40 and ./scores/project >= 40]")
for student in students_that_passed:
    print_xml(student)

<student id="1">
		<firstName>John</firstName>
		<lastName>Snow</lastName>
		<scores>
			<homework1>80</homework1>
			<homework2>70</homework2>
			<project>85</project>
		</scores>
	</student>
	

<student id="3">
                <firstName>Franek</firstName>
                <lastName>Dolas</lastName>
                <scores>
                        <homework1>65</homework1>
                        <homework2>80</homework2>
                        <project>45</project>
                </scores>
        </student>




In [9]:
# Select all students with id>1, note that id being string XPath allows polymorphism on attributes and node values
students_with_required_id = xml_tree.xpath("./student[@id > 1]")
for student in students_with_required_id:
    print_xml(student)

<student id="2">
		<firstName>Jozef</firstName>
		<lastName>Szwejk</lastName>
		<scores>
			<homework1>70</homework1>
			<homework2>65</homework2>
			<project>35</project>
		</scores>
	</student>
        

<student id="3">
                <firstName>Franek</firstName>
                <lastName>Dolas</lastName>
                <scores>
                        <homework1>65</homework1>
                        <homework2>80</homework2>
                        <project>45</project>
                </scores>
        </student>




### Namespaces, schemas and CMDI

#TODO anything on VLO? 
XML document allows for multiple schemas and namespaces to be used within a single document, which helps with agregating information from multiple sources, or providing it in parallel formats. However, when present, XPath requires explicit mapping of namespaces to Uniform Resource Identifirer (URI).

Let's look into real world example of the metadata we will stumble across further in this tutorial. Let's look into [CMDI](https://www.clarin.eu/content/component-metadata) metadata description of available OCR data from Latvian newspaper Valmieras Ziņojumu Lapa. 

In [80]:
# Let's explore 
xml_tree = etree.parse("./example_cmdi.xml")
print_xml(xml_tree, True)

<?xml version='1.0' encoding='UTF-8'?>
<cmd:CMD xmlns:cmd="http://www.clarin.eu/cmd/1" xmlns="http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1639731773869" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" CMDVersion="1.2" xsi:schemaLocation="http://www.clarin.eu/cmd/1         http://infra.clarin.eu/CMDI/1.x/xsd/cmd-envelop.xsd         http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1639731773869         https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.x/profiles/clarin.eu:cr1:p_1639731773869/xsd">
  <cmd:Header>
    <cmd:MdCreator>aggregation_cmdi_creation.py</cmd:MdCreator>
    <cmd:MdCreationDate>2022-04-21</cmd:MdCreationDate>
    <cmd:MdSelfLink>https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/9200303/Valmieras_Zi_ojumu_Lapa_collection.xml</cmd:MdSelfLink>
    <cmd:MdProfile>clarin.eu:cr1:p_1639731773869</cmd:MdProfile>
    <cmd:MdCollectionDisplayName>Europeana newspapers full-text</cmd:MdCollectionDisplayName>
  </cmd:Header>
  <cmd:Resour

Let's focus on the xml document declaration:
```xml
<?xml version='1.0' encoding='UTF-8'?>
<cmd:CMD xmlns:cmd="http://www.clarin.eu/cmd/1" xmlns="http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1639731773869" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" CMDVersion="1.2" xsi:schemaLocation="http://www.clarin.eu/cmd/1         http://infra.clarin.eu/CMDI/1.x/xsd/cmd-envelop.xsd         http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1639731773869         https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.x/profiles/clarin.eu:cr1:p_1639731773869/xsd">
```

`xmlns:` allows for namespace declaration. Default namespace (if none present) is declared by `xmlns`.

XPath does not support implicit default namespace. In order to explore ElementTree with namespaces, we need to pass mapping explicitly.  

In [186]:
print(f"That does not work: {xml_tree.xpath('//Language/code/text()')}")

# Just specify artificial default namespace
nsmap = {"cmd": "http://www.clarin.eu/cmd/1",
         "dft": "http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1639731773869"}

print(f"That does work: {xml_tree.xpath('//dft:Language/dft:code/text()', namespaces=nsmap)}")

That does not work: []
That does work: ['deu', 'lav', 'rus']


Let's explore the collection metadata

In [187]:
print(xml_tree.xpath('//dft:Description/dft:description/text()', namespaces=nsmap))

['Full text content aggregated from Europeana. Title: "Valmieras Ziņojumu Lapa". Years: 1900, 1901, 1903, 1904, 1905, 1906.']


In [188]:
resource_list = xml_tree.xpath('//cmd:ResourceProxyList', namespaces=nsmap)

In [189]:
resource_list = resource_list[0]
print_xml(resource_list, True)

<?xml version='1.0' encoding='UTF-8'?>
<cmd:ResourceProxyList xmlns:cmd="http://www.clarin.eu/cmd/1" xmlns="http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1639731773869" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
      <cmd:ResourceProxy id="landing_page">
        <cmd:ResourceType>LandingPage</cmd:ResourceType>
        <cmd:ResourceRef>https://pro.europeana.eu/page/iiif#download</cmd:ResourceRef>
      </cmd:ResourceProxy>
      <cmd:ResourceProxy id="archive_edm">
        <cmd:ResourceType mimetype="application/zip">Resource</cmd:ResourceType>
        <cmd:ResourceRef>ftp://download.europeana.eu/newspapers/fulltext/edm_issue/9200303.zip</cmd:ResourceRef>
      </cmd:ResourceProxy>
      <cmd:ResourceProxy id="archive_alto">
        <cmd:ResourceType mimetype="application/zip">Resource</cmd:ResourceType>
        <cmd:ResourceRef>ftp://download.europeana.eu/newspapers/fulltext/alto/9200303.zip</cmd:ResourceRef>
      </cmd:ResourceProxy>
      <cmd:ResourceProxy id="_1

We can access metadata about resources by year of the issue.

In [190]:
resource_id = resource_list.xpath("./cmd:ResourceProxy[./cmd:ResourceType/text() = 'Metadata']/@id", namespaces=nsmap)
print(resource_id)

['_1900', '_1901', '_1903', '_1904', '_1905', '_1906']


In [191]:
_id = resource_id[0]
print(f"id: {_id}")
ref_1900 = resource_list.xpath("./cmd:ResourceProxy[@id = '_1900']/cmd:ResourceRef/text()", namespaces=nsmap)
print(ref_1900)
ref_1900 = ref_1900[0]

id: _1900
['https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/9200303/Valmieras_Zi_ojumu_Lapa_1900.xml']


We can resolve referenced metadata in order to familiarise ourselves with how metadata links to the underlying bibliographic resources. 

In [192]:
import requests

response = requests.get(ref_1900)
xml_tree_1900 = etree.fromstring(response.content)
print_xml(xml_tree_1900)

<cmd:CMD xmlns:cmd="http://www.clarin.eu/cmd/1" xmlns="http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1633000337997" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" CMDVersion="1.2" xsi:schemaLocation="http://www.clarin.eu/cmd/1     http://infra.clarin.eu/CMDI/1.x/xsd/cmd-envelop.xsd     http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1633000337997     https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.x/profiles/clarin.eu:cr1:p_1633000337997/xsd">
  <cmd:Header>
    <cmd:MdCreator>aggregation_cmdi_creation.py</cmd:MdCreator>
    <cmd:MdCreationDate>2022-04-21</cmd:MdCreationDate>
    <cmd:MdSelfLink>https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/9200303/Valmieras_Zi_ojumu_Lapa_1900.xml</cmd:MdSelfLink>
    <cmd:MdProfile>clarin.eu:cr1:p_1633000337997</cmd:MdProfile>
    <cmd:MdCollectionDisplayName>Europeana newspapers full-text</cmd:MdCollectionDisplayName>
  </cmd:Header>
  <cmd:Resources>
    <cmd:ResourceProxyList>
      <cmd:ResourceProxy

Let's investigate Subresource element of the xml tree

In [193]:
nsmap["dft"] = "http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1633000337997"
subresources = xml_tree_1900.xpath("//dft:Subresource", namespaces=nsmap)

for subresource in subresources:
    print_xml(subresource)

<Subresource xmlns="http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1633000337997" xmlns:cmd="http://www.clarin.eu/cmd/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" cmd:ref="archive_edm">
        <SubresourceDescription>
          <label>Archive containing full text content in EDM format which includes this title</label>
        </SubresourceDescription>
      </Subresource>
      

<Subresource xmlns="http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1633000337997" xmlns:cmd="http://www.clarin.eu/cmd/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" cmd:ref="archive_alto">
        <SubresourceDescription>
          <label>Archive containing full text content in ALTO format which includes this title</label>
        </SubresourceDescription>
      </Subresource>
      

<Subresource xmlns="http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1633000337997" xmlns:cmd="http://www.clarin.eu/cmd/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" cmd:ref="rp_3000

Let's select all issues from January 1900 and retrieve their download link

In [194]:
import regex as re

In [213]:
date_regex = r"(?P<year>[0-9]{4})\-(?P<month>01)\-(?P<day>[0-9]{2})"


january_subresources = [subresource for subresource in subresources 
                          if subresource.xpath("./dft:SubresourceDescription/dft:TemporalCoverage/dft:label[contains(text(), '-01-')]", namespaces=nsmap)]

for s in january_subresources:
    print_xml(s)

cmd_refs = [subresource.attrib for subresource in january_subresources]

cmd_refs = [subresource.xpath("./@cmd:ref", namespaces=nsmap)[0] for subresource in january_subresources]

direct_data_links = [xml_tree_1900.xpath(f"//cmd:ResourceProxy[@id='{cmd_ref}']/cmd:ResourceRef/text()", namespaces=nsmap)[0] for cmd_ref in cmd_refs]
print(direct_data_links)

<Subresource xmlns="http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1633000337997" xmlns:cmd="http://www.clarin.eu/cmd/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" cmd:ref="rp_3000059947291">
        <SubresourceDescription>
          <label>Valmieras Ziņojumu Lapa - 1900-01-15</label>
          <IdentificationInfo>
            <identifier>http://data.theeuropeanlibrary.org/BibliographicResource/3000059947291</identifier>
            <identifier>3000059947291</identifier>
          </IdentificationInfo>
          <Language>
            <name>German</name>
            <code>deu</code>
          </Language>
          <Language>
            <name>Latvian</name>
            <code>lav</code>
          </Language>
          <Language>
            <name>Russian</name>
            <code>rus</code>
          </Language>
          <TemporalCoverage>
            <label>1900-01-15</label>
            <Start>
              <date>1900-01-15</date>
            </Start>
            <En

More concrete examples of data retrieval will follow in Chapter 2. Hopefully you got some grasp of the linkage between metadata and textual reasources. 